I am writing a script in Python 2.6 (I am very new to python). What I am trying to achieve is the most efficient way of doing the following:
- scan through about 300,000 .bin files
- each file is between 500mb and 900mb
- pull 2 strings located in each file (they are both located towards the beginning of the file)
- puts the output from each file in one .txt file
I wrote the following script, which works, but it processes each file INCREDIBLY slow. It processed about 118 files in past 50 minutes or so:
import re, os, codecs
path = "./" #will search current directory
dir_lib = os.listdir(path)
for book in dir_lib:
if not book.endswith('.bin'): #only looks for files that have .bin extension
continue
file = os.path.join(path, book)
text = codecs.open(file, "r", "utf-8", errors="ignore")
#had to use "ignore" because I kept getting error with binary files:
#UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 10:
#unexpected code byte
for lineout in text:
w = re.search("(Keyword1\:)\s(\[(.+?)\])", lineout)
d = re.search("Keyword2\s(\[(.+?)\])", lineout)
outputfile = open('output.txt', 'w')
if w:
lineout = w.group(3) #first keyword that is between the [ ]
outputfile.write(lineout + ",")
elif d:
lineout = d.group(2) #second keyword that is between the [ ]
outputfile.write(lineout + ";")
outputfile.close()
text.close()
My output comes out fine and exactly how I want it:
keyword1,keyword2;keyword1,keyword2;etc,...;
but with this speed it will take about a month or so of continuous running. Anything else I could try possibly, maybe alternative to regex? A way for it not to scan the whole file and just move on to the next one after it found the keywords?
Thank you for your suggestions.
openoutput.txtfor writing each time you find the target text, you'll be overwriting the logfile each time. You should either open the file for appending, or (even better) leave the file handle open for the duration of the search.