2

I have a custom access LOG for Apache:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{JSESSIONID}C %D %V" mylog

I am trying to parse from Python the LOGs generated; but I have two problems:

  • Requests without request method (HTTP/1.0 or HTTP/1.1) are not parsed correctly.
  • Request with spaces in the requested path are not parsed correctly (I don't know if Apache saves this path encoded or keeps the spaces, but I could generate a LOG line making a request by hand in telnet).

Using this regex:

(?P<ip>.*) (?P<remote_log_name>.*) (?P<userid>.*) \[(?P<date>.*)(?= ) (?P<timezone>.*?)\] \"(?P<request_method>.*) (?P<path>.*)(?P<request_version> HTTP/.*)\" (?P<status>.*) (?P<length>.*) \"(?P<referrer>.*)\" \"(?P<user_agent>.*)\" (?P<session_id>.*) (?P<generation_time_micro>.*) (?P<virtual_host>.*)

The parsing fails with the first 3 lines of this LOG:

1.1.1.2 - - [11/Nov/2016:03:04:55 +0100] "GET /" 200 83 "-" "-" - 9221 1.1.1.1
127.0.0.1 - - [11/Nov/2016:14:24:21 +0100] "GET /uno dos" 404 298 "-" "-" - 400233 1.1.1.1
127.0.0.1 - - [11/Nov/2016:14:23:37 +0100] "GET /uno dos HTTP/1.0" 404 298 "-" "-" - 385111 1.1.1.1
1.1.1.1 - - [11/Nov/2016:00:00:11 +0100] "GET /icc HTTP/1.1" 302 - "-" "XXX XXX XXX" - 6160 11.1.1.1
1.1.1.1 - - [11/Nov/2016:00:00:11 +0100] "GET /icc/ HTTP/1.1" 302 - "-" "XXX XXX XXX" - 2981 1.1.1.1

Regex can be simulated here https://regex101.com/r/xDfSqj/2.

1
  • Could you help me understand why a lookahead (?= ) is used after the date match? Removing this doesn't seem to change the result in the regex simulator, and according to the apache docs, that space is always present. Commented Apr 14, 2017 at 12:17

2 Answers 2

9

Try this solution: https://regex101.com/r/xDfSqj/4

It's the same thing you had, except:

(?P<ip>.*?) (?P<remote_log_name>.*?) (?P<userid>.*?) \[(?P<date>.*?)(?= ) (?P<timezone>.*?)\] \"(?P<request_method>.*?) (?P<path>.*?)(?P<request_version> HTTP/.*)?\" (?P<status>.*?) (?P<length>.*?) \"(?P<referrer>.*?)\" \"(?P<user_agent>.*?)\" (?P<session_id>.*?) (?P<generation_time_micro>.*?) (?P<virtual_host>.*)

A capture group has been added around HTTP/1.0 and given the ? quantifier. This is also added to your other groups to prevent greedy capturing.

Is this what you were trying to achieve?

Sign up to request clarification or add additional context in comments.

2 Comments

Great! Note: it was necessary for me to replace HTTP/ by HTTP\/.
This regex DOES work with malicious payloads that include escaped DQUOTEs in the referrer content so Thank You!!
1

The regex that I used is below

^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"(.+?)\"

You can refer to this link to understand the functioning of apache log file parsing and there is also the code for apache log file parsing using java. Hope you find it good and solves your problem

1 Comment

Unfortunately the regex does not work for Group 5 when there is an escaped DQUOTE character. I get those often with malicious payloads that try to do SQL injection attacks. If anyone can update the regex to work with them I would love to see it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.