I have a CSV that is 600MB and I load it with pandas' read_csv with one of the two methods below.
def read_my_csv1():
df = pd.read_csv('my_data.csv')
print(len(df))
def read_my_csv2():
with open('my_data.csv') as f:
file_contents = f.read()
data_frame = pd.read_csv(io.StringIO(file_contents))
print(len(data_frame))
The first method gives a peak memory usage of 1GB.
The second methods gives a peak memory usage of 4GB.
I measure the peak memory usage with fil-profile.
How can the difference be so large? Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?