3

A little hesitant about posting this - as far as I'm concerned it's genuine question, but I guess I'll understand if it's criticised or closed as being an invite for discussion...

Anyway, I need to use Python to search some quite large web logs for specific events. RegEx would be good but I'm not tied to any particular approach - I just want lines that contain two strings that could appear anywhere in a GET request.

As a typical file is over 400mb and contains around a million lines, performance both in terms of time to complete and load on the server (ubuntu/nginx VM - reasonably well spec'd and rarely overworked) are likely to be issues.

I'm a fairly recent convert to Python (note quite a newbie but still plenty to learn) and I'd like a bit of guidance on the best way to achieve this

Do I open and iterate through? Grep to a new file and then open? Some combination of the two? Something else?

2

1 Answer 1

2

As long as you don't read whole file at once but iterate trough it continuously you should be fine. I think it doesn't really matter whether you read whole file with python or with grep, you still have to load whole file :). And if you take advantage of generators you can do this really programmer friendly:

# Generator; fetch specific rows from log file
def parse_log(filename):
    reg = re.prepare( '...')

    with open(filename,'r') as f:
       for row in f:
           match = reg.match(row)
           if match:
               yield match.group(1)

for i in parse_log('web.log'):
    pass # Do whatever you need with matched row
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.