How to speed up pandas dataframe creation from huge file?

Question

I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:

df = pd.read_csv('data.csv')

But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:

"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."

But I dont see much gain in speed

chunk it up and then do the data analysis in parts. stackoverflow.com/questions/44729727/… — Paul Brennan
– Paul Brennan, Commented Jan 13, 2021 at 16:53
reading csv files is a fairly slow process. If this is a file you expect to Import/Output frequently then you should pay the upfront cost of reading the csv once, and save it in a format that pandas can read much more quickly: pandas.pydata.org/pandas-docs/stable/user_guide/…. Based on their timings, .pkl files can be read nearly 50x faster than .csv files — ALollz
– ALollz, Commented Jan 13, 2021 at 16:55
Do you use the same CSV many times? If so, save it in something like parquet or arrow after you've managed to get it into memory once. — tdelaney
– tdelaney, Commented Jan 13, 2021 at 16:55
maybe take a look at Daskwhich is very similar to Pandasbut supports multicore and handle large dataset. docs.dask.org/en/latest/dataframe.html. Kr. — antoine
– antoine, Commented Jan 13, 2021 at 16:57

Ignacio Alorre · Accepted Answer · 2021-01-13 16:55:11Z

1

If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer

In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:

import pandas as pd 
import datatable as dt

# Read with databale
datatable_df = dt.fread('myfile.csv')

# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()

answered Jan 13, 2021 at 16:55

Ignacio Alorre

7,6558 gold badges65 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tony Balboa Over a year ago

Thanks for the answer. But later how slow is to convert from datatable to pandas? because if it takes too long, may not worth the entire process

Ignacio Alorre Over a year ago

@TonyBalboa actually not much longer than the reading operation. It will depend on the machine where you are running it, but the entire operation of reading + converting should be like a few seconds.

Tony Balboa Over a year ago

Thank you, actually it is much faster indeed

Collectives™ on Stack Overflow

How to speed up pandas dataframe creation from huge file?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related