1

I am writing a script in Python 2.6 (I am very new to python). What I am trying to achieve is the most efficient way of doing the following:

  • scan through about 300,000 .bin files
  • each file is between 500mb and 900mb
  • pull 2 strings located in each file (they are both located towards the beginning of the file)
  • puts the output from each file in one .txt file

I wrote the following script, which works, but it processes each file INCREDIBLY slow. It processed about 118 files in past 50 minutes or so:

 import re, os, codecs

 path = "./" #will search current directory
 dir_lib = os.listdir(path)

 for book in dir_lib:
    if not book.endswith('.bin'): #only looks for files that have .bin extension
            continue
    file = os.path.join(path, book)
    text = codecs.open(file, "r", "utf-8", errors="ignore") 

    #had to use "ignore" because I kept getting error with binary files: 
    #UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 10: 
    #unexpected code byte

    for lineout in text:
            w = re.search("(Keyword1\:)\s(\[(.+?)\])", lineout)
            d = re.search("Keyword2\s(\[(.+?)\])", lineout)

            outputfile = open('output.txt', 'w')

            if w:
                    lineout = w.group(3) #first keyword that is between the [ ]
                    outputfile.write(lineout + ",")
            elif d:
                    lineout = d.group(2) #second keyword that is between the [ ]
                    outputfile.write(lineout + ";")

           outputfile.close()
    text.close()

My output comes out fine and exactly how I want it:

 keyword1,keyword2;keyword1,keyword2;etc,...; 

but with this speed it will take about a month or so of continuous running. Anything else I could try possibly, maybe alternative to regex? A way for it not to scan the whole file and just move on to the next one after it found the keywords?

Thank you for your suggestions.

2
  • I tried it with and without closing, and the speed did has not changed unfortunately. Commented Jan 29, 2014 at 2:51
  • 1
    If you open output.txt for writing each time you find the target text, you'll be overwriting the logfile each time. You should either open the file for appending, or (even better) leave the file handle open for the duration of the search. Commented Jan 29, 2014 at 3:19

3 Answers 3

2

One way is to cheat and imitate grep from unix OS, try http://nedbatchelder.com/code/utilities/pygrep.py

import os

# Get the pygrep script.
if not os.path.exists('pygrep.py'):
    os.system("wget http://nedbatchelder.com/code/utilities/pygrep.py")
from pygrep import grep, Options

# Writes a test file.
text="""This is a text
somehow there are many foo bar in the world.
sometimes they are black sheep, 
sometimes they bar bar black sheep.
most times they foo foo here
and a foo foo there"""
with open('test.txt','w') as fout:
    fout.write(text)

# Here comes the query
queries = ['foo','bar']

opt = Options() # set options for grep.
with open('test.txt','r') as fin:
    for i in queries:
        grep(i, fin, opt)
print
Sign up to request clarification or add additional context in comments.

1 Comment

Or, OP, just use grep in the first place.
2

You can improve your code in at least three ways (in descending order of importance):

  • You don't break out of the inner for loop when both lines are found. This means that the scrip will iterate over the entire file, despite the fact that the two lines are found somewhere in the beginning of the file.
  • If the regexp pattern is identical for all files, you should compile the regexp outside of your outer for loop. If they change from file to file, put them outside your inner for loop. As it stands now, a new regexp object is created at each iteration.

Note: This may not be the case, since the most recent patterns are cached. (but there are no good reasons not to do it)

  • Additionally, you shouldn't open and close the output file at each iteration.

The code below addresses these issues:

import re, os, codecs

path = "./"
dir_lib = os.listdir(path)
w_pattern = re.compile("(Keyword1\:)\s(\[(.+?)\])")
d_pattern = re.compile("Keyword2\s(\[(.+?)\])")

with open('output.txt', 'w') as outputfile:
    for book in dir_lib:
        if not book.endswith('.bin'):
            continue
        filename = os.path.join(path, book)
        with codecs.open(filename, "r", "utf-8", errors="ignore") as text:
            w_found, d_found = False, False
            for lineout in text:
                w = w_pattern.search(lineout)
                d = d_pattern.search(lineout)
                if w:
                    lineout = w.group(3)
                    outputfile.write(lineout + ",")
                    w_found = True
                elif d:
                    lineout = d.group(2)
                    outputfile.write(lineout + ";")
                    d_found = True
                if w_found and d_found:
                    break

4 Comments

Correct. If you do not want to want to use break you can just use a while loop with while not (w_found and d_found):
@user2100799 hard to combine a while loop with iteration through a for loop.
@jmu303 Not really, both methods is appropriate here. You just have to call lineout = next(text) and break out of the loop if StopIteration is raised.
@SteinarLima fair enough.
-1

A few simplifications which may or may not be applicable:

  • I assume that Keyword1 and Keyword2 both occur at the start of a line (so I can use re.match instead of re.search)
  • I assume that Keyword1 will always occur before Keyword2 (so I can search for one, then the other = half as many calls):

and so:

import codecs
import glob
import re

START = re.compile("Keyword1\:\s\[(.+?)\]").match
END   = re.compile("Keyword2\:\s\[(.+?)\]").match

def main():
    with open('output.txt', 'w') as outf:
        for fname in glob.glob('*.bin'):
            with codecs.open(fname, 'rb', 'utf-8', errors='ignore') as inf:
                w = None
                for line in inf:
                    w = START(line)
                    if w:
                        break

                d = None
                for line in inf:
                    d = END(line)
                    if d:
                        break

                if w and d:
                    outf.write('{0},{1};'.format(w.group(2), d.group(2)))

if __name__=="__main__":
    main()

3 Comments

@SteinarLima: is English not your first language? I quite clearly said that "these assumptions may not be applicable". However, if they are applicable I would expect a speed-up of around 2.5x relative to your code above.
I'm sorry, I didn't read your answer thoroughly enough. But you should state that you expect a speedup given these assumptions, otherwise it is not clear why one would solve the problem this way. Just edit your answer, and I'l remove the downvote.
But: Isn't it a bit childish to downvote my question just because you got a downvote from me? It seems immature imo. (sorry if you're not the downvoter)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.