How to processes the extremely large dataset into chunks in Python (Pandas), while considering the full dataset for application of function?

Question

I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.

I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.

My current code is as following.

import pandas as pd

def Group_ID_Company(chunk_of_dataset):
    return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()

chunk_size = 9000000
chunk_skip = 1

transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')

for i in range(0, 38):
    chunk_skip += chunk_size;
    transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
    Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)

There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum() only runs for 9000000 rows, which is the declared as chunk_size. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.

Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.

A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum() the required columns. However, I am thinking may be there is another (better) way of achieving that.

Bilal Mirza · Accepted Answer · 2020-12-05 08:35:11Z

3

The answer form MarianD worked perfectly, I am answering to share the solution code here.

Moreover, DASK is able to utilize all cores equally, whereas, Pandas was using the only one core to 100%. So, that's the another benefit of DASK, I have noticed over Pandas.

import dask.dataframe as dd

transactions_dataset_DF = dd.read_csv('transactions.csv')
Group_ID_Company_DF = transactions_dataset_DF.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum().compute()
Group_ID_Company_DF.to_csv('Group_ID_Company.csv')

# to clear the memory
transactions_dataset_DF = None
Group_ID_Company_DF = None

DASK has been able to read all 350 million rows (20 GB of dataset) at once. Which was not achieved by Pandas previously, I had to create 37 chunks to process the entire dataset and it took almost 2 hours to complete the processing using Pandas.

However, with the DASK, it only took around 20 mins to (at once) read, process and then save the new dataset.

edited Dec 5, 2020 at 8:35

answered Dec 4, 2020 at 17:29

Bilal Mirza

2331 gold badge3 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MarianD Over a year ago

Nice from you to share your experience (and your code, too) here. Not everyone has an opportunity to work with such large datasets.

Bilal Mirza Over a year ago

@MarianD, Thanks.

MarianD · Accepted Answer · 2020-12-03 17:11:55Z

1

The solution for your problem is probably Dask. You may watch the introductory video, read examples and try them online in a live session (in JupyterLab).

edited Dec 3, 2020 at 17:11

answered Dec 3, 2020 at 17:06

MarianD

14.4k12 gold badges50 silver badges61 bronze badges

2 Comments

Bilal Mirza Over a year ago

So, I tried your solution, and it worked. I have been able to read all 35 million rows (20 GB of dataset) at once (which is not achieved by Pandas previously). However, It took around 20 mins to read, process and then save the dataset using DASK.

Bilal Mirza Over a year ago

Moreover, Pandas was using the only one core (to 100%) of my 2x processors. But, DASK was able to utilize both processors (and all cores) equally, so that's the another benefit.

Collectives™ on Stack Overflow

How to processes the extremely large dataset into chunks in Python (Pandas), while considering the full dataset for application of function?

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related