2

I have a log file with 1,001,623 lines formatted as such:

[02/Jan/2012:09:07:32] "GET /click?id=162&prod=5475 HTTP/1.1" 200 4352

Each separated by a new line

I used regular expressions to loop over it and extract the information i need (date,id, product)

for txt in logfile:
    m = rg.search(txt)
    if m:
        l1=m.group(1)
        l2=m.group(2)
        l3=m.group(3)
        dt=dt.append(pd.Series([l1]))
        art=art.append(pd.Series([l2]))
        usr=usr.append(pd.Series([l3]))

This works fine in testing where I only used a small sample but when I used the entire set its been running for 12 hours and not showing any progress. I will then create a dataframe to do some analytics. Is there a better way to do this?

Edit:

This is how I open the log file.

logfile = open("data/access.log", "r")

The regex

re1='.*?'   # Non-greedy match on filler
re2='((?:(?:[0-2]?\\d{1})|(?:[3][01]{1}))[-:\\/.](?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Sept|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[-:\\/.](?:(?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3})))(?![\\d])'  # DDMMMYYYY 1
re3='.*?'   # Non-greedy match on filler
re4='\\d+'  # Uninteresting: int
re5='.*?'   # Non-greedy match on filler
re6='\\d+'  # Uninteresting: int
re7='.*?'   # Non-greedy match on filler
re8='\\d+'  # Uninteresting: int
re9='.*?'   # Non-greedy match on filler
re10='(\\d+)'   # Integer Number 1
re11='.*?'  # Non-greedy match on filler
re12='(\\d+)'   # Integer Number 2

rg =  re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
4
  • Can you please add how you open the file (variable txt)? Commented Jan 10, 2016 at 15:29
  • Can you add a bit more code? How did you assign logfile and what is your regex? Commented Jan 10, 2016 at 15:29
  • Added the information, thank you! Commented Jan 10, 2016 at 15:33
  • It may be taking a long time because I don't think you are reading the file line by line. You should be doing for txt in logfile.readlines(): Commented Jan 10, 2016 at 16:01

1 Answer 1

1

You can use pandas. First strip [] by strip and then convert to_datetime.

Then parse id and prod and last merge all together by concat:

import pandas as pd
import io

temp=u"""[02/Jan/2012:09:07:32] "GET /click?id=162&prod=5475 HTTP/1.1" 200 4352
[02/Jan/2012:09:07:32] "GET /click?id=162&prod=5475 HTTP/1.1" 200 4352
[02/Jan/2012:09:07:32] "GET /click?id=162&prod=5475 HTTP/1.1" 200 4352
[02/Jan/2012:09:07:32] "GET /click?id=162&prod=5475 HTTP/1.1" 200 4352
[02/Jan/2012:09:07:32] "GET /click?id=162&prod=5475 HTTP/1.1" 200 4352"""

#change io.StringIO(temp) to 'filename.csv'
df = pd.read_csv(io.StringIO(temp), sep="\s*", engine='python', header=None, 
                                    names=['date','get','data','http','no1','no2'])

#format - http://strftime.org/
df['date'] = pd.to_datetime(df['date'].str.strip('[]'), format="%d/%b/%Y:%H:%M:%S")

#split Dataframe
df1 = pd.DataFrame([ x.split('=') for x in df['data'].tolist() ], columns=['c','id','prod'])

#split Dataframe
df2 = pd.DataFrame([ x.split('&') for x in df1['id'].tolist() ], columns=['id', 'no3'])
print df

                 date   get                     data       http  no1   no2
0 2012-01-02 09:07:32  "GET  /click?id=162&prod=5475  HTTP/1.1"  200  4352
1 2012-01-02 09:07:32  "GET  /click?id=162&prod=5475  HTTP/1.1"  200  4352
2 2012-01-02 09:07:32  "GET  /click?id=162&prod=5475  HTTP/1.1"  200  4352
3 2012-01-02 09:07:32  "GET  /click?id=162&prod=5475  HTTP/1.1"  200  4352
4 2012-01-02 09:07:32  "GET  /click?id=162&prod=5475  HTTP/1.1"  200  4352
print df1

           c        id  prod
0  /click?id  162&prod  5475
1  /click?id  162&prod  5475
2  /click?id  162&prod  5475
3  /click?id  162&prod  5475
4  /click?id  162&prod  5475
print df2

    id   no3
0  162  prod
1  162  prod
2  162  prod
3  162  prod
4  162  prod

df = pd.concat([df['date'], df1['prod'], df2['id']], axis=1)
print df

                 date  prod   id
0 2012-01-02 09:07:32  5475  162
1 2012-01-02 09:07:32  5475  162
2 2012-01-02 09:07:32  5475  162
3 2012-01-02 09:07:32  5475  162
4 2012-01-02 09:07:32  5475  162
Sign up to request clarification or add additional context in comments.

1 Comment

I clearly still know nothing, that was beautiful! Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.