I have around 50 Million rows of data with 4 Million unique IDs. For each of those IDs, I want to calculate the net amount paid (instead of cumulative, which is represented by "paid"). Original dataframe looks like this: Original dataframe
This is the result I want: Resultant dataframe
Doing this through the usual for loop method is very slow and I'm looking to speed up the process but the main problem is that this needs to be done for each unique ID at a time (to not create discrepancies). Hence, I could not use .apply() for this.
This is the code I have now:
df_super = pd.DataFrame()
for idx, df_sub in df.groupby("ID"):
df_sub.loc[:,'net_paid_amount'] = df_sub['paid'] - df_sub['paid'].shift(1) # Getting difference b/w last amount and current amount
df_sub['net_paid_amount'].fillna(df_sub['paid'].iloc[0], inplace=True) # Filling first value which appears as "NaN"
df_super = df_super.append(df_sub)
Is there any method that can do this instead of using for-loops?
forloop, you can do simply withdf['net_paid_amount'] = df.groupby("ID")['paid'].diff().fillna(df['paid'])and if you don't want to add the column indfbut indf_super, then create a copy first and do the same withdf_super. Note that the order of the result is not the same with both methods