Efficient way to find a string based on a list

Question

I'm new to scripting and have been reading up on Python for about 6 weeks. The below is meant to read a log file and send an alert if one of the keywords defined in srchstring is found. It works as expected and doesn't alert on strings previously found, as expected. However the file its processing is actively being written to by an application and the script is too slow on files around 500mb. under 200mb it works fine ie within 20secs. Could someone suggest a more efficient way to search for a string within a file based on a pre-defined list?

import os
srchstring = ["Shutdown", "Disconnecting", "Stopping Event Thread"]

 if os.path.isfile(r"\\server\\share\\logfile.txt"):
 with open(r"\\server\\share\\logfile.txt","r") as F:
    for line in F:
        for st in srchstring:
            if st in line:
                print line,
                #do some slicing of string to get dd/mm/yy hh:mm:ss:ms
                # then create a marker file called file_dd/mm/yy hh:mm:ss:ms 
                if os.path.isfile("file_dd/mm/yy hh:mm:ss:ms"): # check if a file already exists named file_dd/mm/yy hh:mm:ss:ms
                    print "string previously found- ignoring, continuing search"  # marker file exists
                else:
                    open("file_dd/mm/yy hh:mm:ss:ms", 'a') # create file_dd/mm/yy hh:mm:ss:ms
                    print "error string found--creating marker file sending email alert"  # no marker file, create it then send email

 else:
    print "file not exist"

Does this code run? What's F? I assume is the file you are reading, but the code doesn't reflect that. Also, when you open a file to write you don't close it. The pythonic way to writing to files is using context: with open('filename') as f: .... To your question, I would try usin set instead of a list for srchstring. Then, for each line in the file, make a set of the words in the line (e.g. linset = set(line.split(' '))) and the use set intersection (see docs.python.org/2/library/sets.html). If it's not empty, then there's a match. I'm guessing this could speed up things — jorgeh
– jorgeh, Commented Dec 10, 2015 at 7:59
How do you know that the overhead is coming from the in search? It could be that you are reading the whole file into memory, but you don't show that code. Where does F come from? — cdarke
– cdarke, Commented Dec 10, 2015 at 8:00
Apologies, I missed a line out when I was editing the post for correct formatting. I've updated it now. I read somewhere that a nested 'if' may not be the best way, but I can't find the post that suggested that. This is what led me to believe the IF maybe the bottleneck. — toon
– toon, Commented Dec 10, 2015 at 8:23
I used the 'with open..' as I understand this handles the closing of the file (which accidentally omiited in the original post). I'll experiment with linset = set(line.split(' ')) suggestion. — toon
– toon, Commented Dec 10, 2015 at 8:26
Your code still has indentation errors. I'm guessing the with open should be indented and everything under it up until the else should be reindented correspondingly, but as code edits are discouraged, I merely point that out here. In other words, we can probably guess what you mean, but posting Python code with different indentation than you have locally is extremely bad form. — tripleee
– tripleee, Commented Dec 10, 2015 at 8:30

tripleee · Accepted Answer · 2015-12-10 08:38:59Z

1

Refactoring the search expression to a precompiled regular expression avoids the (explicit) innermost loop.

import os, re
regex = re.compile(r'Shutdown|Disconnecting|Stopping Event Thread')

if os.path.isfile(r"\\server\\share\\logfile.txt"):
    #Indentation fixed as per comment
    with open(r"\\server\\share\\logfile.txt","r") as F:
       for line in F:
            if regex.search(line):
                # ...

answered Dec 10, 2015 at 8:38

tripleee

192k37 gold badges318 silver badges369 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

toon Over a year ago

using regex increased the speed of the search >10 I estimate - thanks

toon Over a year ago

actaully I may have been abit hasty as running it while the file is being written to by the application returns only slighly quicker results. When the file is not being written to the regex is faster so thanks.

u354356007 · Accepted Answer · 2015-12-10 08:48:39Z

0

I assume here that you use Linux. If you don't, install MinGW on Windows and the solution below will become suitable too.

Just leave the hard part to the most efficient tools available. Filter your data before you go to the python script. Use grep command to get the lines containing "Shutdown", "Disconnecting" or "Stopping Event Thread"

grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt

and redirect the lines to your script

grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt | python log.py

Edit: Windows solution. You can create a .bat file to make it executable.

findstr /c:"Shutdown" /c:"Disconnecting" /c:"Stopping Event Thread" \server\share\logfile.txt | python log.py

In 'log.py', read from stdin. It's file-like object, so no difficulties here:

import sys

for line in sys.stdin:
    print line,
    # do some slicing of string to get dd/mm/yy hh:mm:ss:ms
    # then create a marker file called file_dd/mm/yy hh:mm:ss:ms 
    # and so on

This solution will reduce the amount of work your script has to do. As Python isn't a fast language, it may speed up the task. I suspect it can be rewritten purely in bash and it will be even faster (20+ years of optimization of a C program is not the thing you compete with easily), but I don't know bash enough.

edited Dec 10, 2015 at 8:48

answered Dec 10, 2015 at 8:27

u354356007

3,23519 silver badges25 bronze badges

5 Comments

u354356007 Over a year ago

@toon anyway, the approach is still valid. grep is ported to windows several times (I suggest to get it by installing MinGW package), and Windows has its own tools find and findstr. I'll update the post shortly with the Windows analog if I find any suitable one.

tripleee Over a year ago

Why are you saying Python isn't fast? For a scripting language, it's reasonably well optimized for reading input one line at a time. The remaining Python script is rather silly, though.

toon Over a year ago

why is it silly? the finding part isn't that hard but re-reading the file and ignoring a string that has previously been found can be tricky for a beginner

u354356007 Over a year ago

@tripleee python script is probably slower than a specialized tool for searching strings in text streams, that's what I wanted to say.

toon Over a year ago

@Vovanrock2002 your method is by far the quickest, so marked this a the correct answer. All feedback has been a great learning experience. You got to start somewhere right? :)

Collectives™ on Stack Overflow

Efficient way to find a string based on a list

2 Answers 2

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related