0

I have a CSV that is 600MB and I load it with pandas' read_csv with one of the two methods below.

def read_my_csv1():
    df = pd.read_csv('my_data.csv')
    print(len(df))

def read_my_csv2():
    with open('my_data.csv') as f:
        file_contents = f.read()
    data_frame = pd.read_csv(io.StringIO(file_contents))
    print(len(data_frame))

The first method gives a peak memory usage of 1GB.

The second methods gives a peak memory usage of 4GB.

I measure the peak memory usage with fil-profile.

How can the difference be so large? Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?

4
  • Method 2 will read the contents of the file as a single big string, which is useless except for parsing the CSV data. It will always be inefficient that way, at least for the size of the file. Why do you care about peak memory usage? Commented Apr 25, 2022 at 11:23
  • @ThomasWeller I care about peak memory usage because my computer doesn't have infinite memory. (And in my situation I need to load the CSV in a container with low RAM and no swap) Commented Apr 25, 2022 at 11:27
  • In that case, a better question to ask would be: "How to process a 600 MB CSV file on a machine which has only 2 GB RAM?" Knowing how the difference can be so large does not help you. Restricting the peak memory usage does not help you either (because I would have told you to remove RAM modules in order to reduce memory usage and increase hard disk usage) Commented Apr 25, 2022 at 12:06
  • @ThomasWeller I am asking a very specific question. I am not asking for a solution for a problem I didn't expose and that you're trying to guess. I know you're trying to be helpful but really, if I wanted to ask the question you're talking about, I would have asked it already. Commented Apr 25, 2022 at 12:15

2 Answers 2

2

How can the difference be so large?

StringIO uses a buffer of type Py_UCS4 [source]. That is a 32 bit datatype, while the CSV file is probably ASCII or UTF-8. So we have an overhead of factor 3 here, accounting for additional ~1.8 GB. Also, the StringIO buffer may overallocate for 12.5% [source].

Best case:

file_contents    600 MB
io.StringIO     2400 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3600 MB (at least)

Case with 12,5% overallocation:

file_contents    600 MB
io.StringIO     2700 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3900 MB (at least)

Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?

  • del the temporary objects
  • Don't use StringIO.
Sign up to request clarification or add additional context in comments.

Comments

0

It looks like StringIO maintains its own copy of the string data, so at least temporarily you have three copies of your data in memory — one in file_contents, one in the StringIO object, and one in the final dataframe. Meanwhile, it is at least theoretically possible for read_csv to read the input file line by line, and thereby only have one copy of the full data, in the final dataframe, when reading directly from the file.

You could try deleting file_contents after creating the StringIO object and see if that improves things.

1 Comment

deleting file_contents does reduce peak memory usage from 4GB to 3.4GB. Still is suspiciously high.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.