Look Ahead in Regex for Web Server log?

Question

Explanation

In my log I have many of such lines:

"[14/Oct/2014:13:02:15 +0200]","70","-","192.168.1.1","/API-1.2/testeo_keyword/vcn,ge/channel,rateber/site,bla_.de/keyword,null/px2.js","?ts=0.3054514767395726", "200","+", "http://www.bla.de/Arzt/Baden-W%C3%BCrttemberg/328-Heidelberg/Neurochirurgie/","Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50527; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.2; MS-RTC LM 8)","-"0/hurlau,superman;tile,4;status,0/pxl.js","?ts=0.3001205851715877", "200","+", "http://www.super.de/news/audio-video/carl-zeiss-praesentiert-3d-brille-100-euro-742545.html","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0","-"

What to capture?

From the n - 2nd field (the one with the URL) I need to capture the domain name and for every domain name=super.de I need to collect the whole URL.

What do I have?

I have this RegEx: http://regexr.com/39q1b where I managed to capture all I need, but is this correct the way I am doing it? ((match)match). Later I need to, everywhere where domainname="super.de, collect the whole URL. Also the www is optional. Note: The first URL occurence (www.bla.de) needs to be ignored.

anubhava · Accepted Answer · 2014-10-27 10:19:57Z

1

I think that complex regex can be simplified looking at your requirement to capture the URL for every domain name=super.de:

https?:\/\/(?:www\.)?super.de[^"]+(?!.*?super\.de)

RegEx Demo

edited Oct 27, 2014 at 10:19

answered Oct 27, 2014 at 10:13

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

anubhava Over a year ago

@SirBenBenji Have you seen this working demo with your new example?

Cloudscape Germany Over a year ago

If this would work with regex101.com/r/cU3sY9/2 it would be great. Also I need this to be more efficient than a lookahead if possible, since I will need to crawl Gigabytes.

anubhava Over a year ago

I have already provided you a working demo for your example i.e. regex101.com/r/cU3sY9/3 If you want to avoid lookahead then I think for more efficiency it is better to pre-process your data by splitting on comma and only apply non-lookahead regex on a particular element of resulting array.

Cloudscape Germany Over a year ago

Ok if we add two parantheses at the beginning and end, your example will work - regex101.com/r/cU3sY9/4

anubhava Over a year ago

Another way to simplify that regex is: ^(?:[^"]*"[^"]*"[^"]*){14}"[^"]*"(https?:\/\/(?:www\.)?super.de[^"]+) See this demo: regex101.com/r/cU3sY9/5

|

vks · Accepted Answer · 2014-10-27 10:13:09Z

1

((?:www\.)?super\.de[^"]*)

You can try this to grab the url's with super.de as domain.See demo.Use re.findall or re.search

http://regex101.com/r/sU3fA2/9

answered Oct 27, 2014 at 10:13

vks

68.1k11 gold badges96 silver badges132 bronze badges

2 Comments

Cloudscape Germany Over a year ago

Almost perfect, but this would also capture the whole wrong group in cases where super occurs somewhere else than in the n - 2nd group, e.g. the 5th group (it happens).

Cloudscape Germany Over a year ago

regex101.com/r/cU3sY9/2 and I only need this to match the n-2nd group, where n=last group.

Collectives™ on Stack Overflow

Look Ahead in Regex for Web Server log?

Explanation

What to capture?

What do I have?

2 Answers 2

RegEx Demo

9 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Explanation

What to capture?

What do I have?

2 Answers 2

9 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related