I have a log file with 1,001,623 lines formatted as such:
[02/Jan/2012:09:07:32] "GET /click?id=162&prod=5475 HTTP/1.1" 200 4352
Each separated by a new line
I used regular expressions to loop over it and extract the information i need (date,id, product)
for txt in logfile:
m = rg.search(txt)
if m:
l1=m.group(1)
l2=m.group(2)
l3=m.group(3)
dt=dt.append(pd.Series([l1]))
art=art.append(pd.Series([l2]))
usr=usr.append(pd.Series([l3]))
This works fine in testing where I only used a small sample but when I used the entire set its been running for 12 hours and not showing any progress. I will then create a dataframe to do some analytics. Is there a better way to do this?
Edit:
This is how I open the log file.
logfile = open("data/access.log", "r")
The regex
re1='.*?' # Non-greedy match on filler
re2='((?:(?:[0-2]?\\d{1})|(?:[3][01]{1}))[-:\\/.](?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Sept|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[-:\\/.](?:(?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3})))(?![\\d])' # DDMMMYYYY 1
re3='.*?' # Non-greedy match on filler
re4='\\d+' # Uninteresting: int
re5='.*?' # Non-greedy match on filler
re6='\\d+' # Uninteresting: int
re7='.*?' # Non-greedy match on filler
re8='\\d+' # Uninteresting: int
re9='.*?' # Non-greedy match on filler
re10='(\\d+)' # Integer Number 1
re11='.*?' # Non-greedy match on filler
re12='(\\d+)' # Integer Number 2
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
logfileand what is your regex?for txt in logfile.readlines():