how to read a large compressed file in python without loading it all in memory

Question

I have large log files that are in compressed format. ie largefile.gz these are commonly 4-7gigs each.

Here's the relevant part of the code:

for filename in os.listdir(path):
     if not filename.startswith("."):
         with open(b, 'a') as newfile,  gzip.GzipFile(path+filename,'rb') as oldfile:
             # BEGIN Reads each remaining line from the log into a list
             data = oldfile.readlines()  
             for line in data:
                 parts = line.split()

after this the code will do some calculations (basically totaling up a the bytes) and will write to a file that says "total bytes for x critera = y". All this works fine in a small file. But on a large file it kills the system

What I think my program is doing is reading the whole file, storing it in data Correct me if i'm wrong but I think its trying to put the whole log into memory first.

Question: how I can read 1 line from the compressed file , process it then move on to the next without trying to store the whole thing in memory first? (or is it really already doing that.. I'm not sure but based on looking at the activity monitor my guess is that it is trying to go all in memory)

Thanks

Generators are used to yield values. See this SO: stackoverflow.com/questions/519633/… — Adam Ranganathan
– Adam Ranganathan, Commented Jan 31, 2017 at 2:05

Charles Duffy · Accepted Answer · 2017-01-31 00:43:00Z

4

It wasn't storing the entire content in-memory until you told it to. That is to say -- instead of:

# BAD: stores your whole file's decompressed contents, split into lines, in data
data = oldfile.readlines()  
for line in data:
    parts = line.split()

...use:

# GOOD: Iterates a line at a time
for line in oldfile:
    parts = line.split()

...so you aren't storing the entire file in a variable. And obviously, don't store parts anywhere that persists past the one line either.

That easy.

answered Jan 31, 2017 at 0:43

Charles Duffy

299k43 gold badges441 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ShadowRanger Over a year ago

I think readlines is one of the worst methods Python created as far as making the "one obvious way to do it" the wrong way. People see it, and assume it's the correct way to read in lines, and never learn about file objects being iterators naturally. Most of the time, you want to just iterate the file object directly, and on the rare occasions you need it in another form, you could just use list(myfile) (or anything else that accepts an iterable and creates a data structure from it) without needing .readlines() at all.

chowpay Over a year ago

@charles-duffy that seems to work! Is it possible to make it faster by loading say 4 gigs (or some arbitrary number/%) of the file into memory then processing off of that. Would it speed things up or make negligible difference?

Charles Duffy Over a year ago

@chowpay, since the compression algorithm is already working in larger chunks than a line at a time, I'd expect it to be negligible.

Collectives™ on Stack Overflow

how to read a large compressed file in python without loading it all in memory

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related