1

I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:

df = pd.read_csv('data.csv') 

But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:

"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."

But I dont see much gain in speed

8
  • 2
    chunk it up and then do the data analysis in parts. stackoverflow.com/questions/44729727/… Commented Jan 13, 2021 at 16:53
  • 1
    reading csv files is a fairly slow process. If this is a file you expect to Import/Output frequently then you should pay the upfront cost of reading the csv once, and save it in a format that pandas can read much more quickly: pandas.pydata.org/pandas-docs/stable/user_guide/…. Based on their timings, .pkl files can be read nearly 50x faster than .csv files Commented Jan 13, 2021 at 16:55
  • Do you use the same CSV many times? If so, save it in something like parquet or arrow after you've managed to get it into memory once. Commented Jan 13, 2021 at 16:55
  • maybe take a look at Daskwhich is very similar to Pandasbut supports multicore and handle large dataset. docs.dask.org/en/latest/dataframe.html. Kr. Commented Jan 13, 2021 at 16:57
  • @PaulBrennan thanks, I will look into it. Seems useful Commented Jan 13, 2021 at 16:58

1 Answer 1

1

If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer

In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:

import pandas as pd 
import datatable as dt

# Read with databale
datatable_df = dt.fread('myfile.csv')

# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the answer. But later how slow is to convert from datatable to pandas? because if it takes too long, may not worth the entire process
@TonyBalboa actually not much longer than the reading operation. It will depend on the machine where you are running it, but the entire operation of reading + converting should be like a few seconds.
Thank you, actually it is much faster indeed

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.