regex with variable part

Question

How can I merge these 2 regex's to a single regex which captures all available parts depending on the string structure ( the last 3 fields in $s are optional and should be captured if they exists)? Using (?= ... ) I could not get a working solution.

$s='1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    \Z/x;
print "[".join('],[',$s =~ $re)."]\n\n";   

$s='1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+) [ ] "(.*?)" [ ] "(.*?)" [ ] "(.*?)"
        \Z
        /x;
print "[".join('],[',$s =~ $re)."]\n\n";

update: my $tokenize = qr/\A (\d+)\.(\d+)\.(\d+)\.(\d+) [ ] (\S+) (?: [ ] (\S*))? (?: [ ] (\S*))? [ ] \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\] [ ] "(?:(\S+) [ ])? (.*?) (?:[ ] (\S+))?" [ ] (\S+) [ ] (\S+) (?: [ ] "(.*?)" [ ] "(.*?)" [ ] "(.*?)" )? \Z/x; — bootware
– bootware, Commented Mar 27, 2013 at 3:39

TLP · Accepted Answer · 2013-03-27 01:46:59Z

4

When your regexes start looking like that, I think its a good idea to start thinking about alternatives. In this case, you might try Text::ParseWords, since your strings are sort of delimited and contain quoted strings. It is a core module in perl 5.

Basically what we're doing is supplying a regex for the delimiters that we expect, a 0 or 1 for keeping the quotes, and the input lines themselves.

use strict;
use warnings;
use Text::ParseWords;

my $s = '1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
my @s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";   

$s = '1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
@s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";

Output:

[1],[2],[3],[4],[-],[egon],[10],[Dec],[2007],[21],[07],[20],[+0100],[GET /x.htm
HTTP/1.1],[401],[488]

[1],[2],[3],[4],[-],[-],[13],[Jun],[2007],[01],[37],[44],[+0200],[GET /x.htm HTT
P/1.0],[404],[283],[-],[Mozilla/5.0...],[-]

answered Mar 27, 2013 at 1:46

TLP

68.3k10 gold badges97 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

bootware Over a year ago

Nice idea, but unfortunately one can not keep the grouping of the words. The request or user agent for example may contain more then one word and should be returned as single scalars. @TLP

TLP Over a year ago

@bootware I have no idea what you are talking about. With this solution the request or user agent can contain any number of words at all, it doesn't matter. Your regex, on the other hand, can only handle exactly 3 words, and breaks for all other cases.

bootware Over a year ago

The regex handles all words of the request an keeps them together - look here: $s='1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /this is a test HTTP/1.0" 404 283'; print "[".join('],[',$s =~ $re)."]\n\n"; [1],[2],[3],[4],[-],[-],[13],[Jun],[2007],[01],[37],[44],[+0200],[GET],[/this is a test],[HTTP/1.0],[404],[283],[],[],[] @TLP

TLP Over a year ago

@bootware No, the regex breaks off GET and HTTP/1.0. My solution keeps the string intact: GET /this is a test HTTP/1.0. The benefit of my solution is that it is easier to maintain. If you want to post-process the request string and break off GET and HTTP/1.0, you can easily do that.

TLP Over a year ago

@bootware Well, no. If there is no password in authentication then your regex also fails. So... no. You don't have to use this solution, but don't try and make up things that are not true about it, please. That is just annoying. Find me a test case where your regex works and my solution fails if you want to continue this discussion.

|

bonsaiviking · Accepted Answer · 2013-03-27 01:39:44Z

2

Instead of using a lookahead (?=), you can use a non-capturing group (?:) and match zero or one occurrence:

$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

This will yield fixed-length array of captures, but the last 3 will be undef if the optional capture group does not match. If you have to match between 1 and 3 optional fields, wrap each in its own non-capturing group with zero or more (?) occurrences. I also tried this, but it doesn't work:

(?: [ ] "(.*?)" ){0,3} \Z

It matches, and captures each of the last three fields, but each capture overwrites the final position in the capture array, so after the capture is done, it contains just the final field.

I would caution you that you are using a very strict expression that may not be suited to all web logs: specifically, the match for IP address will not handle IPv6 addresses, and the match for User-agent may not handle user agents with " characters, depending on how they are escaped (lighttpd 1.4.28 does not escape them, for instance).

answered Mar 27, 2013 at 1:39

bonsaiviking

6,0651 gold badge23 silver badges37 bronze badges

1 Comment

bootware Over a year ago

Thank you for the hint. You never stop learning. Your warnings are useful too. I'll test ist. @bonsaiviking

bootware · Accepted Answer · 2013-03-27 18:01:58Z

I did not want to talk any solution hints down.

How I said before: Nice idea. But it only does what the package name predicates: ParseWords.

"Find me a test case where your regex works and my solution fails if you want to continue this discussion ...".

Of course I have testet your solution for my purposes.

In your solution the fields are shifted around, depending on the input.

With the regex I'll find the fields always at defined positions.

(for example: Authuser at $token[5] and Year at $token[9] )

Here is the test:

#!/usr/bin/perl -w
use strict;
use warnings;
use FileHandle;
use Text::ParseWords;

my $re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    (?: [ ] (\S*))? (?: [ ] (\S*))?
    [ ] \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(?:(\S+) [ ])? (.*?) (?:[ ] (\S+))?"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

my (@s,@token);
#---- most entries ------------------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283');
#---- referer, user agent, ... ------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"');
#---- auth without password ---------------------------------------------------
push(@s,'1.2.3.4 - ausr  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488');
#---- no http request --------------------------------------------------------- 
push(@s,'1.2.3.4 - - [13/Jun/2007:19:16:18 +0200] "-" 408 -');
#---- auth with password ------------------------------------------------------
push(@s,'1.2.3.4 - ausr pwd [12/Jul/2006:16:55:04 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- auth without user -------------------------------------------------------
push(@s,'1.2.3.4 -  pwd [16/Aug/2007:08:43:50 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- multiple words in request -----------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /this is test HTTP/1.0" 404 283'); 

no warnings 'uninitialized';
foreach(@s)
{ @token=$_ =~ $re;
  print "regex:      AUTHUSER=".$token[5].", YEAR=".$token[9]."\n";
  @token=quotewords('[\s/:\[\].]+', 0, $_);
  print "quotewords: AUTHUSER=".$token[5].", YEAR=".$token[9]."\n\n";
}

and here the results:

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=ausr, YEAR=2007
quotewords: AUTHUSER=ausr, YEAR=21

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=19

regex:      AUTHUSER=ausr, YEAR=2006
quotewords: AUTHUSER=ausr, YEAR=2006

regex:      AUTHUSER=, YEAR=2007
quotewords: AUTHUSER=pwd, YEAR=08

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

Collectives™ on Stack Overflow

regex with variable part

3 Answers 3

6 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related