3

I am writing some Python code that loops through a number of files and processes the first few hundred lines of each file. I would like to extend this code so that if any of the files in the list are compressed, it will automatically decompress while reading them, so that my code always receives the decompressed lines. Essentially my code currently looks like:

for f in files:
    handle = open(f)
    process_file_contents(handle)

Is there any function that can replace open in the above code so that if f is either plain text or gzip-compressed text (or bzip2, etc.), the function will always return a file handle to the decompressed contents of the file? (No seeking required, just sequential access.)

2
  • That's not a duplicate. I know how to use gzip.open. I'm essentially asking if there's a function that looks at the file and automatically chooses open, gzip.open, or whatever other open function is appropriate for the compression being used, so I don't have to write a bunch of try/catch statements to try every possible open function myself. Commented Aug 21, 2013 at 21:23
  • Something like this? Commented Aug 21, 2013 at 21:56

1 Answer 1

5

I had the same problem: I'd like my code to accept filenames and return a filehandle to be used with with, automatically compressed & etc.

In my case, I'm willing to trust the filename extensions and I only need to deal with gzip and maybe bzip files.

import gzip
import bz2

def open_by_suffix(filename):
    if filename.endswith('.gz'):
        return gzip.open(filename, 'rb')
    elif filename.endswith('.bz2'):
        return bz2.BZ2file(filename, 'r')
    else:
        return open(filename, 'r')

If we don't trust the filenames, we can compare the initial bytes of the file for magic strings (modified from https://stackoverflow.com/a/13044946/117714):

import gzip
import bz2

magic_dict = {
    "\x1f\x8b\x08": (gzip.open, 'rb')
    "\x42\x5a\x68": (bz2.BZ2File, 'r')
}
max_len = max(len(x) for x in magic_dict)

def open_by_magic(filename):
    with open(filename) as f:
        file_start = f.read(max_len)
    for magic, (fn, flag) in magic_dict.items():
        if file_start.startswith(magic):
            return fn(filename, flag)
    return open(filename, 'r')

Usage:

# cat
for filename in filenames:
    with open_by_suffix(filename) as f:
        for line in f:
            print f

Your use-case would look like:

for f in files:
    with open_by_suffix(f) as handle:
        process_file_contents(handle)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.