Peak memory usage much larger when using pandas read_csv with StringIO instead of a file object

Question

I have a CSV that is 600MB and I load it with pandas' read_csv with one of the two methods below.

def read_my_csv1():
    df = pd.read_csv('my_data.csv')
    print(len(df))

def read_my_csv2():
    with open('my_data.csv') as f:
        file_contents = f.read()
    data_frame = pd.read_csv(io.StringIO(file_contents))
    print(len(data_frame))

The first method gives a peak memory usage of 1GB.

The second methods gives a peak memory usage of 4GB.

I measure the peak memory usage with fil-profile.

How can the difference be so large? Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?

Method 2 will read the contents of the file as a single big string, which is useless except for parsing the CSV data. It will always be inefficient that way, at least for the size of the file. Why do you care about peak memory usage? — Thomas Weller
– Thomas Weller, Commented Apr 25, 2022 at 11:23
@ThomasWeller I care about peak memory usage because my computer doesn't have infinite memory. (And in my situation I need to load the CSV in a container with low RAM and no swap) — Rems
– Rems, Commented Apr 25, 2022 at 11:27
In that case, a better question to ask would be: "How to process a 600 MB CSV file on a machine which has only 2 GB RAM?" Knowing how the difference can be so large does not help you. Restricting the peak memory usage does not help you either (because I would have told you to remove RAM modules in order to reduce memory usage and increase hard disk usage) — Thomas Weller
– Thomas Weller, Commented Apr 25, 2022 at 12:06
@ThomasWeller I am asking a very specific question. I am not asking for a solution for a problem I didn't expose and that you're trying to guess. I know you're trying to be helpful but really, if I wanted to ask the question you're talking about, I would have asked it already. — Rems
– Rems, Commented Apr 25, 2022 at 12:15

Thomas Weller · Accepted Answer · 2022-04-25 13:59:40Z

2

How can the difference be so large?

StringIO uses a buffer of type Py_UCS4 [source]. That is a 32 bit datatype, while the CSV file is probably ASCII or UTF-8. So we have an overhead of factor 3 here, accounting for additional ~1.8 GB. Also, the StringIO buffer may overallocate for 12.5% [source].

Best case:

file_contents    600 MB
io.StringIO     2400 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3600 MB (at least)

Case with 12,5% overallocation:

file_contents    600 MB
io.StringIO     2700 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3900 MB (at least)

Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?

del the temporary objects
Don't use StringIO.

answered Apr 25, 2022 at 13:59

Thomas Weller

61.2k23 gold badges143 silver badges263 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ture Pålsson · Accepted Answer · 2022-04-25 12:43:24Z

0

It looks like StringIO maintains its own copy of the string data, so at least temporarily you have three copies of your data in memory — one in file_contents, one in the StringIO object, and one in the final dataframe. Meanwhile, it is at least theoretically possible for read_csv to read the input file line by line, and thereby only have one copy of the full data, in the final dataframe, when reading directly from the file.

You could try deleting file_contents after creating the StringIO object and see if that improves things.

answered Apr 25, 2022 at 12:43

Ture Pålsson

7,0392 gold badges17 silver badges22 bronze badges

1 Comment

Rems Over a year ago

deleting file_contents does reduce peak memory usage from 4GB to 3.4GB. Still is suspiciously high.

Collectives™ on Stack Overflow

Peak memory usage much larger when using pandas read_csv with StringIO instead of a file object

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related