0

Let's say I have a directory with a undefined name of text file. So I want to check how many words of a certain set are in each of them. Since those files can have huge sizes I was wondering what would be the most efficient way to this with Python. This classic approach does not look as the ideal one:

for file in files:
    with open(file) as f:
        content = f.readlines()
        for word in words:
            if word in content:
                count+=1

My questions are:

  1. How should I handle large files in memory?
  2. The complexity of this is O(n*m) where n= # files and m = # words, is it possible to reduce this? Or is there any data structure that could help me?

1 Answer 1

3

First step would be to not use readlines() - it dumps the contents of the whole file into memory, all at once, so time complexity aside the memory complexity is straight up O(n*m). You can reduce it by using readline() instead, reading it line by line until EOF.

Time-wise, you're looking for a dict of some sort - probably collections.Counter. It allows O(1) lookup for the words already encountered.

Sign up to request clarification or add additional context in comments.

2 Comments

Yes you are right regarding the memory complexity, but using readline will also create a lot of reads, I guess that using a buffer that can store more than just a line would be better (or maybe readline does this by itself). But I am not following what you are trying to say with respect to the time..
Readlines() calls readline() repeatedly, so it's equivalent. It's kinda like the difference between a list comprehension and a generator expression, the end result is the same but you're doing it either all in one go or piecemeal. Time - dicts allow you to avoid iterating over the list of words you already counted to find a match to increment; it's a hashmap.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.