Best way to handle files in memory Python

Question

Let's say I have a directory with a undefined name of text file. So I want to check how many words of a certain set are in each of them. Since those files can have huge sizes I was wondering what would be the most efficient way to this with Python. This classic approach does not look as the ideal one:

for file in files:
    with open(file) as f:
        content = f.readlines()
        for word in words:
            if word in content:
                count+=1

My questions are:

How should I handle large files in memory?
The complexity of this is O(n*m) where n= # files and m = # words, is it possible to reduce this? Or is there any data structure that could help me?

jkm · Accepted Answer · 2017-11-30 15:42:17Z

3

First step would be to not use readlines() - it dumps the contents of the whole file into memory, all at once, so time complexity aside the memory complexity is straight up O(n*m). You can reduce it by using readline() instead, reading it line by line until EOF.

Time-wise, you're looking for a dict of some sort - probably collections.Counter. It allows O(1) lookup for the words already encountered.

answered Nov 30, 2017 at 15:42

jkm

7044 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

m33n Over a year ago

Yes you are right regarding the memory complexity, but using readline will also create a lot of reads, I guess that using a buffer that can store more than just a line would be better (or maybe readline does this by itself). But I am not following what you are trying to say with respect to the time..

jkm Over a year ago

Readlines() calls readline() repeatedly, so it's equivalent. It's kinda like the difference between a list comprehension and a generator expression, the end result is the same but you're doing it either all in one go or piecemeal. Time - dicts allow you to avoid iterating over the list of words you already counted to find a match to increment; it's a hashmap.

Collectives™ on Stack Overflow

Best way to handle files in memory Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related