1

I am working with a 468 MB zip file that contains a single file, which is a CSV text file. I don't want to extract the entire text file, so I read the zip file a binary chunk at a time. The chunk size is something like 65536 bytes.

I know I can read the file with Python's csvfile library, but in this case, the chunks that I feed it will not necessarily fall on a line boundary.

How can I do this? (p.s., I do not want to have to use Pandas)

Thanks.

2
  • I believe the ZipFile module allows you to create a stream to the zip file. You can then use this in csv.reader(). It won't read the entire thing into memory. Commented Apr 1, 2022 at 23:16
  • @Barmar yeah that would definitely work, except it might only be a binary stream... Commented Apr 1, 2022 at 23:18

1 Answer 1

5

You just need to do something like:

import zipfile
import io
import csv


with zipfile.ZipFile("test.zip") as zipf:
    with zipf.open("test.csv", "r") as f:
        reader = csv.reader(
            io.TextIOWrapper(f, newline='')
        )
        for row in reader:
            do_something(row)

Assuming you have a zip archive like:

jarrivillaga$ unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
1308888890  04-01-2022 16:23   test.csv
---------                     -------
1308888890                     1 file

Note, the zipf.open returns a binary stream, so you can just use an io.TextIOWrapper to make it a text stream, which would work with any of the csv.reader or csv.DictReader objects.

This should read it in reasonably sized chunks by default, probably whatever io.DEFAULT_BUFFER_SIZE is, because looking at the zipfile.ZipExtFile source code it is inheriting from io.BufferedIOBase.

Sign up to request clarification or add additional context in comments.

4 Comments

The binary stream itself should be fine without a TextIOWrapper because a csv.reader (normally) wants a file opened with newline="".
Yes, this works! I had to fiddle with the encoding and delimiter, but it works. Thanks!
@martineau no, csv reader types expect a text stream since Python 3, they don't accept bytes
@martineau but good point about newline=''!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.