Pandas: Replace column values in a group by based on value from the next row

Question

I am working with this dataframe where it has to be grouped by DocumentId and PersonId. Within that group, if the End Date column is empty then fill it with the row where DocCode is RT.

DocumentID,PersonID,DocCode,StartDate,Amount,EndDate
120303,110001,FB,5/18/21,245,
120303,110001,TW,5/25/21,460,
120303,110001,RT,6/1/21,,6/6/21
120303,110011,GK,4/1/21,0,
120303,110011,AK,4/8/21,128,
120303,110011,PL,4/12/21,128,
120303,110011,FB,4/16/21,256,
120303,110011,RT,4/28/21,,5/4/21

It works fine but there is another twist to it. Within that group of DocumentId & PersonID if the amount changes, the next amount's StartDate will be the previous amount's EndDate. So intermediate dataframe will look like this:

And then all the rows with duplicate amounts within that group and empty amounts will be collapsed into 1 row.

Final dataset will look like this:

Here is the code I am using to fill up all the empty EndDate columns from the row where DocCode is RT:

df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)

def fill_end_date(df):
    rt_doc = df[df["DocumentCode"] == "RT"]
    # if there is row in this group by with DocumentCode RT
    if not rt_doc.empty:
        end_date = rt_doc.iloc[0]["EndDate"]
        # and EndDate not empty
        if pd.notnull(end_date):
            # all other rows need to be filled by that end date
            df = df.fillna({"EndDate": end_date})

    return df

didn't downvote but you shared images not text and people cannot copy the data and test their solutions (unless they construct the dataframe line-by-line themselves, which is highly unlikely), and pd.read_image doesn't exist yet — Mustafa Aydın
– Mustafa Aydın, Commented Jun 9, 2021 at 18:25
Make sense, I can replace excel screenshots with actual data. It was intuitive to show the intermediate result by highlighting the column where the values are coming from. — Himanshu Yadav
– Himanshu Yadav, Commented Jun 9, 2021 at 18:28

Ank · Accepted Answer · 2021-06-09 19:30:50Z

First forward-fill your Amount column values to fill in all blanks. This is done to find when Amount changes. Shift the StartDate values back by 1, which will be used to fill into EndDate column wherever Amount change is detected:

df.Amount.ffill(inplace=True)
df['StartDateShift'] = df['StartDate'].shift(-1)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount EndDate StartDateShift
0      120303    110001      FB   5/18/21   245.0     NaN        5/25/21
1      120303    110001      TW   5/25/21   460.0     NaN         6/1/21
2      120303    110001      RT    6/1/21   460.0  6/6/21         4/1/21
3      120303    110011      GK    4/1/21     0.0     NaN         4/8/21
4      120303    110011      AK    4/8/21   128.0     NaN        4/12/21
5      120303    110011      PL   4/12/21   128.0     NaN        4/16/21
6      120303    110011      FB   4/16/21   256.0     NaN        4/28/21
7      120303    110011      RT   4/28/21   256.0  5/4/21            NaN

Now get all those rows where Amount changes, and create a mask from it. Use this mask to fill into NaNs in EndDate with values from StartDateShift:

mask = (df['Amount'] != df['Amount'].shift(-1))
df['EndDate'].fillna(df[mask]['StartDateShift'], inplace=True)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate StartDateShift
0      120303    110001      FB   5/18/21   245.0  5/25/21        5/25/21
1      120303    110001      TW   5/25/21   460.0      NaN         6/1/21
2      120303    110001      RT    6/1/21   460.0   6/6/21         4/1/21
3      120303    110011      GK    4/1/21     0.0   4/8/21         4/8/21
4      120303    110011      AK    4/8/21   128.0      NaN        4/12/21
5      120303    110011      PL   4/12/21   128.0  4/16/21        4/16/21
6      120303    110011      FB   4/16/21   256.0      NaN        4/28/21
7      120303    110011      RT   4/28/21   256.0   5/4/21            NaN

Back-fill the EndDate values, and then drop duplicates in Amount column (collapsing empty ones into one row). Also now drop the StartDateShift column, and you will get your final dataframe:

df['EndDate'].bfill(inplace=True)
df = df.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate
0      120303    110001      FB   5/18/21   245.0  5/25/21
1      120303    110001      TW   5/25/21   460.0   6/6/21
3      120303    110011      GK    4/1/21     0.0   4/8/21
4      120303    110011      AK    4/8/21   128.0  4/16/21
6      120303    110011      FB   4/16/21   256.0   5/4/21

In above steps, I didn't groupby DocumentId and PersonId columns as the logic worked on your sample df without it. But for your actual use-case, you can apply groupby on DocumentId and PersonId columns and perform all steps above for each group, then concat in the end:

dfs = []
for i,dfg in df.groupby(['DocumentID','PersonID']):
    dfg.Amount.ffill(inplace=True)
    dfg['StartDateShift'] = dfg['StartDate'].shift(-1)
    dfg['EndDate'].fillna(dfg[~(dfg['Amount']==dfg['Amount'].shift(-1))]['StartDateShift'], inplace=True)
    dfg['EndDate'].bfill(inplace=True)
    dfg = dfg.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)
    dfs.append(dfg)

final_df = pd.concat(dfs)

>>> final_df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate
0      120303    110001      FB   5/18/21   245.0  5/25/21
1      120303    110001      TW   5/25/21   460.0   6/6/21
3      120303    110011      GK    4/1/21     0.0   4/8/21
4      120303    110011      AK    4/8/21   128.0  4/16/21
6      120303    110011      FB   4/16/21   256.0   5/4/21

Collectives™ on Stack Overflow

Pandas: Replace column values in a group by based on value from the next row

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related