Pandas - How to parallelize for loop execution on sub-sets of data

Question

I have around 50 Million rows of data with 4 Million unique IDs. For each of those IDs, I want to calculate the net amount paid (instead of cumulative, which is represented by "paid"). Original dataframe looks like this: Original dataframe

This is the result I want: Resultant dataframe

Doing this through the usual for loop method is very slow and I'm looking to speed up the process but the main problem is that this needs to be done for each unique ID at a time (to not create discrepancies). Hence, I could not use .apply() for this.

This is the code I have now:

df_super = pd.DataFrame()
for idx, df_sub in df.groupby("ID"):
    df_sub.loc[:,'net_paid_amount'] = df_sub['paid'] - df_sub['paid'].shift(1) # Getting difference b/w last amount and current amount
    df_sub['net_paid_amount'].fillna(df_sub['paid'].iloc[0], inplace=True) # Filling first value which appears as "NaN"
    df_super = df_super.append(df_sub)

Is there any method that can do this instead of using for-loops?

Hi, it looks like if you don't need a for loop, you can do simply with df['net_paid_amount'] = df.groupby("ID")['paid'].diff().fillna(df['paid']) and if you don't want to add the column in df but in df_super, then create a copy first and do the same with df_super. Note that the order of the result is not the same with both methods — Ben.T
– Ben.T, Commented Nov 10, 2020 at 13:19
This is exactly what I was looking for. Thanks a lot Ben.T !! — harshit0511
– harshit0511, Commented Nov 11, 2020 at 7:50

harshit0511 · Accepted Answer · 2020-11-11 07:56:57Z

1

As Ben.T correctly pointed out in the comments, the solution is:

df['net_paid_amount'] = df.groupby("ID")['paid'].diff().fillna(df['paid'])

where the .diff() works to get difference b/w consecutive rows, .groupby("ID") makes sure it only executes on each unique ID and .fillna(df['paid']) takes care of the "NaN' values created after this operation.

answered Nov 11, 2020 at 7:56

harshit0511

214 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ben.T Over a year ago

and for a bit more explanation, what was super slow in your code is using append on a dataframe at each loop. it would be more efficient to append in a list or add an entry in a dictionary in the loop, thenconcat outside, see this

harshit0511 Over a year ago

okay, I didn't know that appending to DataFrame was a super slow process, thanks! And I'm trying to accept my answer but the website says I can do that only after 24 hours, so I'll do that as soon as the time is over.

Collectives™ on Stack Overflow

Pandas - How to parallelize for loop execution on sub-sets of data

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related