0

How can I merge these 2 regex's to a single regex which captures all available parts depending on the string structure ( the last 3 fields in $s are optional and should be captured if they exists)? Using (?= ... ) I could not get a working solution.

$s='1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    \Z/x;
print "[".join('],[',$s =~ $re)."]\n\n";   

$s='1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+) [ ] "(.*?)" [ ] "(.*?)" [ ] "(.*?)"
        \Z
        /x;
print "[".join('],[',$s =~ $re)."]\n\n";   
1
  • update: my $tokenize = qr/\A (\d+)\.(\d+)\.(\d+)\.(\d+) [ ] (\S+) (?: [ ] (\S*))? (?: [ ] (\S*))? [ ] \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\] [ ] "(?:(\S+) [ ])? (.*?) (?:[ ] (\S+))?" [ ] (\S+) [ ] (\S+) (?: [ ] "(.*?)" [ ] "(.*?)" [ ] "(.*?)" )? \Z/x; Commented Mar 27, 2013 at 3:39

3 Answers 3

4

When your regexes start looking like that, I think its a good idea to start thinking about alternatives. In this case, you might try Text::ParseWords, since your strings are sort of delimited and contain quoted strings. It is a core module in perl 5.

Basically what we're doing is supplying a regex for the delimiters that we expect, a 0 or 1 for keeping the quotes, and the input lines themselves.

use strict;
use warnings;
use Text::ParseWords;

my $s = '1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
my @s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";   

$s = '1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
@s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";   

Output:

[1],[2],[3],[4],[-],[egon],[10],[Dec],[2007],[21],[07],[20],[+0100],[GET /x.htm
HTTP/1.1],[401],[488]

[1],[2],[3],[4],[-],[-],[13],[Jun],[2007],[01],[37],[44],[+0200],[GET /x.htm HTT
P/1.0],[404],[283],[-],[Mozilla/5.0...],[-]
Sign up to request clarification or add additional context in comments.

6 Comments

Nice idea, but unfortunately one can not keep the grouping of the words. The request or user agent for example may contain more then one word and should be returned as single scalars. @TLP
@bootware I have no idea what you are talking about. With this solution the request or user agent can contain any number of words at all, it doesn't matter. Your regex, on the other hand, can only handle exactly 3 words, and breaks for all other cases.
The regex handles all words of the request an keeps them together - look here: $s='1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /this is a test HTTP/1.0" 404 283'; print "[".join('],[',$s =~ $re)."]\n\n"; [1],[2],[3],[4],[-],[-],[13],[Jun],[2007],[01],[37],[44],[+0200],[GET],[/this is a test],[HTTP/1.0],[404],[283],[],[],[] @TLP
@bootware No, the regex breaks off GET and HTTP/1.0. My solution keeps the string intact: GET /this is a test HTTP/1.0. The benefit of my solution is that it is easier to maintain. If you want to post-process the request string and break off GET and HTTP/1.0, you can easily do that.
@bootware Well, no. If there is no password in authentication then your regex also fails. So... no. You don't have to use this solution, but don't try and make up things that are not true about it, please. That is just annoying. Find me a test case where your regex works and my solution fails if you want to continue this discussion.
|
2

Instead of using a lookahead (?=), you can use a non-capturing group (?:) and match zero or one occurrence:

$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

This will yield fixed-length array of captures, but the last 3 will be undef if the optional capture group does not match. If you have to match between 1 and 3 optional fields, wrap each in its own non-capturing group with zero or more (?) occurrences. I also tried this, but it doesn't work:

(?: [ ] "(.*?)" ){0,3} \Z

It matches, and captures each of the last three fields, but each capture overwrites the final position in the capture array, so after the capture is done, it contains just the final field.

I would caution you that you are using a very strict expression that may not be suited to all web logs: specifically, the match for IP address will not handle IPv6 addresses, and the match for User-agent may not handle user agents with " characters, depending on how they are escaped (lighttpd 1.4.28 does not escape them, for instance).

1 Comment

Thank you for the hint. You never stop learning. Your warnings are useful too. I'll test ist. @bonsaiviking
0

I did not want to talk any solution hints down.

How I said before: Nice idea. But it only does what the package name predicates: ParseWords.

"Find me a test case where your regex works and my solution fails if you want to continue this discussion ...".

Of course I have testet your solution for my purposes.

In your solution the fields are shifted around, depending on the input.

With the regex I'll find the fields always at defined positions.

(for example: Authuser at $token[5] and Year at $token[9] )

Here is the test:

#!/usr/bin/perl -w
use strict;
use warnings;
use FileHandle;
use Text::ParseWords;

my $re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    (?: [ ] (\S*))? (?: [ ] (\S*))?
    [ ] \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(?:(\S+) [ ])? (.*?) (?:[ ] (\S+))?"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

my (@s,@token);
#---- most entries ------------------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283');
#---- referer, user agent, ... ------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"');
#---- auth without password ---------------------------------------------------
push(@s,'1.2.3.4 - ausr  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488');
#---- no http request --------------------------------------------------------- 
push(@s,'1.2.3.4 - - [13/Jun/2007:19:16:18 +0200] "-" 408 -');
#---- auth with password ------------------------------------------------------
push(@s,'1.2.3.4 - ausr pwd [12/Jul/2006:16:55:04 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- auth without user -------------------------------------------------------
push(@s,'1.2.3.4 -  pwd [16/Aug/2007:08:43:50 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- multiple words in request -----------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /this is test HTTP/1.0" 404 283'); 

no warnings 'uninitialized';
foreach(@s)
{ @token=$_ =~ $re;
  print "regex:      AUTHUSER=".$token[5].", YEAR=".$token[9]."\n";
  @token=quotewords('[\s/:\[\].]+', 0, $_);
  print "quotewords: AUTHUSER=".$token[5].", YEAR=".$token[9]."\n\n";
}

and here the results:

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=ausr, YEAR=2007
quotewords: AUTHUSER=ausr, YEAR=21

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=19

regex:      AUTHUSER=ausr, YEAR=2006
quotewords: AUTHUSER=ausr, YEAR=2006

regex:      AUTHUSER=, YEAR=2007
quotewords: AUTHUSER=pwd, YEAR=08

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

1 Comment

I have posted the tests. Regards @TLP

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.