1

Explanation

In my log I have many of such lines:

"[14/Oct/2014:13:02:15 +0200]","70","-","192.168.1.1","/API-1.2/testeo_keyword/vcn,ge/channel,rateber/site,bla_.de/keyword,null/px2.js","?ts=0.3054514767395726", "200","+", "http://www.bla.de/Arzt/Baden-W%C3%BCrttemberg/328-Heidelberg/Neurochirurgie/","Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50527; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.2; MS-RTC LM 8)","-"0/hurlau,superman;tile,4;status,0/pxl.js","?ts=0.3001205851715877", "200","+", "http://www.super.de/news/audio-video/carl-zeiss-praesentiert-3d-brille-100-euro-742545.html","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0","-"

What to capture?

From the n - 2nd field (the one with the URL) I need to capture the domain name and for every domain name=super.de I need to collect the whole URL.

What do I have?

I have this RegEx: http://regexr.com/39q1b where I managed to capture all I need, but is this correct the way I am doing it? ((match)match). Later I need to, everywhere where domainname="super.de, collect the whole URL. Also the www is optional. Note: The first URL occurence (www.bla.de) needs to be ignored.

2 Answers 2

1

I think that complex regex can be simplified looking at your requirement to capture the URL for every domain name=super.de:

https?:\/\/(?:www\.)?super.de[^"]+(?!.*?super\.de)

RegEx Demo

Sign up to request clarification or add additional context in comments.

9 Comments

If this would work with regex101.com/r/cU3sY9/2 it would be great. Also I need this to be more efficient than a lookahead if possible, since I will need to crawl Gigabytes.
I have already provided you a working demo for your example i.e. regex101.com/r/cU3sY9/3 If you want to avoid lookahead then I think for more efficiency it is better to pre-process your data by splitting on comma and only apply non-lookahead regex on a particular element of resulting array.
Ok if we add two parantheses at the beginning and end, your example will work - regex101.com/r/cU3sY9/4
Another way to simplify that regex is: ^(?:[^"]*"[^"]*"[^"]*){14}"[^"]*"(https?:\/\/(?:www\.)?super.de[^"]+) See this demo: regex101.com/r/cU3sY9/5
|
1
((?:www\.)?super\.de[^"]*)

You can try this to grab the url's with super.de as domain.See demo.Use re.findall or re.search

http://regex101.com/r/sU3fA2/9

2 Comments

Almost perfect, but this would also capture the whole wrong group in cases where super occurs somewhere else than in the n - 2nd group, e.g. the 5th group (it happens).
regex101.com/r/cU3sY9/2 and I only need this to match the n-2nd group, where n=last group.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.