So lately I've been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, and searching them for strings from some array ( this array can have as many as 1000 strings in it). The problem here is that i have to find a specific occurrence of the string, and the string may appear unlimited number of times in that file. Also, some decoding and encoding is required, which additionally slows the script down. Code looks something like this:
strings = [a for a in open('file.txt')]
with open("er.txt", "r") as f:
for chunk in f:
for s in strings
#do search, trimming, stripping ..
My question here is:
Is there a way to optimize this? I tried multiprocessing, but it helps little ( or at least the way i implemented it ) The problem here is that these chunk operations aren't independent and strings list may be altered during one of them.
Any optimization would help (string search algorithms, file reading etc.) I did as much as i could regarding loop breaking, but it still runs pretty slow.