2

I'm using the Python 3 CSV Reader to read some web-log files into a namedtuple. I have no control over the log file structures, and there are varying types.

The delimiter is a space ( ), the problem is that some log file formats place a space in the timestamp, as Logfile 2 below. The CSV reader then reads the date/time stamp as two fields.

Logfile 1

73 58 2993 [22/Jul/2016:06:51:06.299] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"

Logfile 2

13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
153 38 224 [22/Jul/2016:06:51:07 +0000] 2[2] "GET /test HTTP/1.1"

The log files typically have the timestamp within square quotes, but I cannot find a way of handling them as "quotes". On top of that, square brackets are not always used as quotes within the logs either (see the [2] later in the logs).

I've read through the Python 3 CSV Reader documentation, including about dialects, but there doesn't seem to be anything for handling enclosing square brackets.

How can I handle this situation automatically?

2 Answers 2

1

This will do, you need to use a regex in place of sep.
This for example will parse NGinx log files into a pandas.Dataframe:

import pandas as pd

df = pd.read_csv(log_file,
              sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
              engine='python',
              usecols=[0, 3, 4, 5, 6, 7, 8],
              names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
              na_values='-',
              header=None
                )

Edit :

line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'

import re
print re.match(regex, line).groups()

The output would be a tuple with 6 pieces of information

('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
Sign up to request clarification or add additional context in comments.

2 Comments

I'm guessing this requires the Pandas library. It's good to know it's possible this way, but I was hoping to do it with just the standard libraries.
Added a version with just regex instead of having to add pandas
0

Bruteforce! I concatenate the 2 time fields so the timestamp can be interpreted if needed.

   data.csv = 
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1" 
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"


import csv

with open("data.csv") as f:
    c = csv.reader(f, delimiter=" ")
    for row in c:
        if len(row) == 7:
            r = row[:3] + [row[3] + row[4]] + row[5:]
            print(r)
        else:
            print(row)

['13', '58', '224', '[22/Jul/2016:06:51:06.399]', '2[2]', 'GET /example HTTP/1.1']
['13', '58', '224', '[22/Jul/2016:06:51:06+0000]', '2[2]', 'GET /test HTTP/1.1']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.