2

I have large log files that are in compressed format. ie largefile.gz these are commonly 4-7gigs each.

Here's the relevant part of the code:

for filename in os.listdir(path):
     if not filename.startswith("."):
         with open(b, 'a') as newfile,  gzip.GzipFile(path+filename,'rb') as oldfile:
             # BEGIN Reads each remaining line from the log into a list
             data = oldfile.readlines()  
             for line in data:
                 parts = line.split()

after this the code will do some calculations (basically totaling up a the bytes) and will write to a file that says "total bytes for x critera = y". All this works fine in a small file. But on a large file it kills the system

What I think my program is doing is reading the whole file, storing it in data Correct me if i'm wrong but I think its trying to put the whole log into memory first.

Question: how I can read 1 line from the compressed file , process it then move on to the next without trying to store the whole thing in memory first? (or is it really already doing that.. I'm not sure but based on looking at the activity monitor my guess is that it is trying to go all in memory)

Thanks

1

1 Answer 1

4

It wasn't storing the entire content in-memory until you told it to. That is to say -- instead of:

# BAD: stores your whole file's decompressed contents, split into lines, in data
data = oldfile.readlines()  
for line in data:
    parts = line.split()

...use:

# GOOD: Iterates a line at a time
for line in oldfile:
    parts = line.split()

...so you aren't storing the entire file in a variable. And obviously, don't store parts anywhere that persists past the one line either.

That easy.

Sign up to request clarification or add additional context in comments.

3 Comments

I think readlines is one of the worst methods Python created as far as making the "one obvious way to do it" the wrong way. People see it, and assume it's the correct way to read in lines, and never learn about file objects being iterators naturally. Most of the time, you want to just iterate the file object directly, and on the rare occasions you need it in another form, you could just use list(myfile) (or anything else that accepts an iterable and creates a data structure from it) without needing .readlines() at all.
@charles-duffy that seems to work! Is it possible to make it faster by loading say 4 gigs (or some arbitrary number/%) of the file into memory then processing off of that. Would it speed things up or make negligible difference?
@chowpay, since the compression algorithm is already working in larger chunks than a line at a time, I'd expect it to be negligible.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.