Search for multiple strings in large files with Python

Question

I am writing a script in Python 2.6 (I am very new to python). What I am trying to achieve is the most efficient way of doing the following:

scan through about 300,000 .bin files
each file is between 500mb and 900mb
pull 2 strings located in each file (they are both located towards the beginning of the file)
puts the output from each file in one .txt file

I wrote the following script, which works, but it processes each file INCREDIBLY slow. It processed about 118 files in past 50 minutes or so:

 import re, os, codecs

 path = "./" #will search current directory
 dir_lib = os.listdir(path)

 for book in dir_lib:
    if not book.endswith('.bin'): #only looks for files that have .bin extension
            continue
    file = os.path.join(path, book)
    text = codecs.open(file, "r", "utf-8", errors="ignore") 

    #had to use "ignore" because I kept getting error with binary files: 
    #UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 10: 
    #unexpected code byte

    for lineout in text:
            w = re.search("(Keyword1\:)\s(\[(.+?)\])", lineout)
            d = re.search("Keyword2\s(\[(.+?)\])", lineout)

            outputfile = open('output.txt', 'w')

            if w:
                    lineout = w.group(3) #first keyword that is between the [ ]
                    outputfile.write(lineout + ",")
            elif d:
                    lineout = d.group(2) #second keyword that is between the [ ]
                    outputfile.write(lineout + ";")

           outputfile.close()
    text.close()

My output comes out fine and exactly how I want it:

 keyword1,keyword2;keyword1,keyword2;etc,...;

but with this speed it will take about a month or so of continuous running. Anything else I could try possibly, maybe alternative to regex? A way for it not to scan the whole file and just move on to the next one after it found the keywords?

Thank you for your suggestions.

I tried it with and without closing, and the speed did has not changed unfortunately. — Oleg
– Oleg, Commented Jan 29, 2014 at 2:51
If you open output.txt for writing each time you find the target text, you'll be overwriting the logfile each time. You should either open the file for appending, or (even better) leave the file handle open for the duration of the search. — Joel Cornett
– Joel Cornett, Commented Jan 29, 2014 at 3:19

alvas · Accepted Answer · 2014-01-29 03:15:07Z

2

One way is to cheat and imitate grep from unix OS, try http://nedbatchelder.com/code/utilities/pygrep.py

import os

# Get the pygrep script.
if not os.path.exists('pygrep.py'):
    os.system("wget http://nedbatchelder.com/code/utilities/pygrep.py")
from pygrep import grep, Options

# Writes a test file.
text="""This is a text
somehow there are many foo bar in the world.
sometimes they are black sheep, 
sometimes they bar bar black sheep.
most times they foo foo here
and a foo foo there"""
with open('test.txt','w') as fout:
    fout.write(text)

# Here comes the query
queries = ['foo','bar']

opt = Options() # set options for grep.
with open('test.txt','r') as fin:
    for i in queries:
        grep(i, fin, opt)
print

edited Jan 29, 2014 at 3:15

answered Jan 29, 2014 at 2:59

alvas

123k118 gold badges504 silver badges810 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jayelm Over a year ago

Or, OP, just use grep in the first place.

Steinar Lima · Accepted Answer · 2014-01-29 03:27:21Z

2

You can improve your code in at least three ways (in descending order of importance):

You don't break out of the inner for loop when both lines are found. This means that the scrip will iterate over the entire file, despite the fact that the two lines are found somewhere in the beginning of the file.
If the regexp pattern is identical for all files, you should compile the regexp outside of your outer for loop. If they change from file to file, put them outside your inner for loop. As it stands now, a new regexp object is created at each iteration.

Note: This may not be the case, since the most recent patterns are cached. (but there are no good reasons not to do it)

Additionally, you shouldn't open and close the output file at each iteration.

The code below addresses these issues:

import re, os, codecs

path = "./"
dir_lib = os.listdir(path)
w_pattern = re.compile("(Keyword1\:)\s(\[(.+?)\])")
d_pattern = re.compile("Keyword2\s(\[(.+?)\])")

with open('output.txt', 'w') as outputfile:
    for book in dir_lib:
        if not book.endswith('.bin'):
            continue
        filename = os.path.join(path, book)
        with codecs.open(filename, "r", "utf-8", errors="ignore") as text:
            w_found, d_found = False, False
            for lineout in text:
                w = w_pattern.search(lineout)
                d = d_pattern.search(lineout)
                if w:
                    lineout = w.group(3)
                    outputfile.write(lineout + ",")
                    w_found = True
                elif d:
                    lineout = d.group(2)
                    outputfile.write(lineout + ";")
                    d_found = True
                if w_found and d_found:
                    break

edited Jan 29, 2014 at 3:27

answered Jan 29, 2014 at 2:48

Steinar Lima

7,8292 gold badges41 silver badges39 bronze badges

4 Comments

1478963 Over a year ago

Correct. If you do not want to want to use break you can just use a while loop with while not (w_found and d_found):

jayelm Over a year ago

@user2100799 hard to combine a while loop with iteration through a for loop.

Steinar Lima Over a year ago

@jmu303 Not really, both methods is appropriate here. You just have to call lineout = next(text) and break out of the loop if StopIteration is raised.

jayelm Over a year ago

@SteinarLima fair enough.

Hugh Bothwell · Accepted Answer · 2014-01-29 03:21:13Z

-1

A few simplifications which may or may not be applicable:

I assume that Keyword1 and Keyword2 both occur at the start of a line (so I can use re.match instead of re.search)
I assume that Keyword1 will always occur before Keyword2 (so I can search for one, then the other = half as many calls):

and so:

import codecs
import glob
import re

START = re.compile("Keyword1\:\s\[(.+?)\]").match
END   = re.compile("Keyword2\:\s\[(.+?)\]").match

def main():
    with open('output.txt', 'w') as outf:
        for fname in glob.glob('*.bin'):
            with codecs.open(fname, 'rb', 'utf-8', errors='ignore') as inf:
                w = None
                for line in inf:
                    w = START(line)
                    if w:
                        break

                d = None
                for line in inf:
                    d = END(line)
                    if d:
                        break

                if w and d:
                    outf.write('{0},{1};'.format(w.group(2), d.group(2)))

if __name__=="__main__":
    main()

answered Jan 29, 2014 at 3:21

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

3 Comments

Hugh Bothwell Over a year ago

@SteinarLima: is English not your first language? I quite clearly said that "these assumptions may not be applicable". However, if they are applicable I would expect a speed-up of around 2.5x relative to your code above.

Steinar Lima Over a year ago

I'm sorry, I didn't read your answer thoroughly enough. But you should state that you expect a speedup given these assumptions, otherwise it is not clear why one would solve the problem this way. Just edit your answer, and I'l remove the downvote.

Steinar Lima Over a year ago

But: Isn't it a bit childish to downvote my question just because you got a downvote from me? It seems immature imo. (sorry if you're not the downvoter)

Collectives™ on Stack Overflow

Search for multiple strings in large files with Python

3 Answers 3

1 Comment

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related