1

I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.

I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.

My current code is as following.

import pandas as pd

def Group_ID_Company(chunk_of_dataset):
    return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()

chunk_size = 9000000
chunk_skip = 1

transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')

for i in range(0, 38):
    chunk_skip += chunk_size;
    transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
    Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)

There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum() only runs for 9000000 rows, which is the declared as chunk_size. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.

Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.

A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum() the required columns. However, I am thinking may be there is another (better) way of achieving that.

2 Answers 2

3

The answer form MarianD worked perfectly, I am answering to share the solution code here.

Moreover, DASK is able to utilize all cores equally, whereas, Pandas was using the only one core to 100%. So, that's the another benefit of DASK, I have noticed over Pandas.

import dask.dataframe as dd

transactions_dataset_DF = dd.read_csv('transactions.csv')
Group_ID_Company_DF = transactions_dataset_DF.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum().compute()
Group_ID_Company_DF.to_csv('Group_ID_Company.csv')

# to clear the memory
transactions_dataset_DF = None
Group_ID_Company_DF = None

DASK has been able to read all 350 million rows (20 GB of dataset) at once. Which was not achieved by Pandas previously, I had to create 37 chunks to process the entire dataset and it took almost 2 hours to complete the processing using Pandas.

However, with the DASK, it only took around 20 mins to (at once) read, process and then save the new dataset.

Sign up to request clarification or add additional context in comments.

2 Comments

Nice from you to share your experience (and your code, too) here. Not everyone has an opportunity to work with such large datasets.
@MarianD, Thanks.
1

The solution for your problem is probably Dask. You may watch the introductory video, read examples and try them online in a live session (in JupyterLab).

2 Comments

So, I tried your solution, and it worked. I have been able to read all 35 million rows (20 GB of dataset) at once (which is not achieved by Pandas previously). However, It took around 20 mins to read, process and then save the dataset using DASK.
Moreover, Pandas was using the only one core (to 100%) of my 2x processors. But, DASK was able to utilize both processors (and all cores) equally, so that's the another benefit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.