Let's say I have a directory with a undefined name of text file. So I want to check how many words of a certain set are in each of them. Since those files can have huge sizes I was wondering what would be the most efficient way to this with Python. This classic approach does not look as the ideal one:
for file in files:
with open(file) as f:
content = f.readlines()
for word in words:
if word in content:
count+=1
My questions are:
- How should I handle large files in memory?
- The complexity of this is O(n*m) where n= # files and m = # words, is it possible to reduce this? Or is there any data structure that could help me?