Pandas read_csv with 4GB of csv

Question

My machine was laggy while trying to read a 4GB of csv in jupyter notebook with chunksize option: raw = pd.read_csv(csv_path, chunksize=10**6) data = pd.concat(raw, ignore_index=True) This takes forever to run and also freeze my machine (Ubuntu 16.04 with 16GB of RAM). What is the right way to do this?

XxX · Accepted Answer · 2018-03-13 19:32:19Z

2

The point of using chunk is that you don't need the whole dataset in memory at one time and you can process each chunk when you read the file. Assuming you don't need the whole dataset in memory at one time, you can do

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
   do_something(chunk)

answered Mar 13, 2018 at 19:32

XxX

363 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas read_csv with 4GB of csv

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related