1

I have around 50 Million rows of data with 4 Million unique IDs. For each of those IDs, I want to calculate the net amount paid (instead of cumulative, which is represented by "paid"). Original dataframe looks like this: Original dataframe

This is the result I want: Resultant dataframe

Doing this through the usual for loop method is very slow and I'm looking to speed up the process but the main problem is that this needs to be done for each unique ID at a time (to not create discrepancies). Hence, I could not use .apply() for this.

This is the code I have now:

df_super = pd.DataFrame()
for idx, df_sub in df.groupby("ID"):
    df_sub.loc[:,'net_paid_amount'] = df_sub['paid'] - df_sub['paid'].shift(1) # Getting difference b/w last amount and current amount
    df_sub['net_paid_amount'].fillna(df_sub['paid'].iloc[0], inplace=True) # Filling first value which appears as "NaN"
    df_super = df_super.append(df_sub)

Is there any method that can do this instead of using for-loops?

2
  • Hi, it looks like if you don't need a for loop, you can do simply with df['net_paid_amount'] = df.groupby("ID")['paid'].diff().fillna(df['paid']) and if you don't want to add the column in df but in df_super, then create a copy first and do the same with df_super. Note that the order of the result is not the same with both methods Commented Nov 10, 2020 at 13:19
  • 1
    This is exactly what I was looking for. Thanks a lot Ben.T !! Commented Nov 11, 2020 at 7:50

1 Answer 1

1

As Ben.T correctly pointed out in the comments, the solution is:

df['net_paid_amount'] = df.groupby("ID")['paid'].diff().fillna(df['paid'])

where the .diff() works to get difference b/w consecutive rows, .groupby("ID") makes sure it only executes on each unique ID and .fillna(df['paid']) takes care of the "NaN' values created after this operation.

Sign up to request clarification or add additional context in comments.

2 Comments

and for a bit more explanation, what was super slow in your code is using append on a dataframe at each loop. it would be more efficient to append in a list or add an entry in a dictionary in the loop, thenconcat outside, see this
okay, I didn't know that appending to DataFrame was a super slow process, thanks! And I'm trying to accept my answer but the website says I can do that only after 24 hours, so I'll do that as soon as the time is over.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.