0

I am working with this dataframe where it has to be grouped by DocumentId and PersonId. Within that group, if the End Date column is empty then fill it with the row where DocCode is RT.enter image description here

DocumentID,PersonID,DocCode,StartDate,Amount,EndDate
120303,110001,FB,5/18/21,245,
120303,110001,TW,5/25/21,460,
120303,110001,RT,6/1/21,,6/6/21
120303,110011,GK,4/1/21,0,
120303,110011,AK,4/8/21,128,
120303,110011,PL,4/12/21,128,
120303,110011,FB,4/16/21,256,
120303,110011,RT,4/28/21,,5/4/21

It works fine but there is another twist to it. Within that group of DocumentId & PersonID if the amount changes, the next amount's StartDate will be the previous amount's EndDate. So intermediate dataframe will look like this:

enter image description here

And then all the rows with duplicate amounts within that group and empty amounts will be collapsed into 1 row.

Final dataset will look like this: enter image description here

Here is the code I am using to fill up all the empty EndDate columns from the row where DocCode is RT:

df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)

def fill_end_date(df):
    rt_doc = df[df["DocumentCode"] == "RT"]
    # if there is row in this group by with DocumentCode RT
    if not rt_doc.empty:
        end_date = rt_doc.iloc[0]["EndDate"]
        # and EndDate not empty
        if pd.notnull(end_date):
            # all other rows need to be filled by that end date
            df = df.fillna({"EndDate": end_date})

    return df
4
  • Before downvoting, atleast ask a clarifying question. Commented Jun 9, 2021 at 18:22
  • 1
    didn't downvote but you shared images not text and people cannot copy the data and test their solutions (unless they construct the dataframe line-by-line themselves, which is highly unlikely), and pd.read_image doesn't exist yet Commented Jun 9, 2021 at 18:25
  • Make sense, I can replace excel screenshots with actual data. It was intuitive to show the intermediate result by highlighting the column where the values are coming from. Commented Jun 9, 2021 at 18:28
  • 1
    @MustafaAydın Does it help? Commented Jun 9, 2021 at 18:33

1 Answer 1

1

First forward-fill your Amount column values to fill in all blanks. This is done to find when Amount changes. Shift the StartDate values back by 1, which will be used to fill into EndDate column wherever Amount change is detected:

df.Amount.ffill(inplace=True)
df['StartDateShift'] = df['StartDate'].shift(-1)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount EndDate StartDateShift
0      120303    110001      FB   5/18/21   245.0     NaN        5/25/21
1      120303    110001      TW   5/25/21   460.0     NaN         6/1/21
2      120303    110001      RT    6/1/21   460.0  6/6/21         4/1/21
3      120303    110011      GK    4/1/21     0.0     NaN         4/8/21
4      120303    110011      AK    4/8/21   128.0     NaN        4/12/21
5      120303    110011      PL   4/12/21   128.0     NaN        4/16/21
6      120303    110011      FB   4/16/21   256.0     NaN        4/28/21
7      120303    110011      RT   4/28/21   256.0  5/4/21            NaN

Now get all those rows where Amount changes, and create a mask from it. Use this mask to fill into NaNs in EndDate with values from StartDateShift:

mask = (df['Amount'] != df['Amount'].shift(-1))
df['EndDate'].fillna(df[mask]['StartDateShift'], inplace=True)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate StartDateShift
0      120303    110001      FB   5/18/21   245.0  5/25/21        5/25/21
1      120303    110001      TW   5/25/21   460.0      NaN         6/1/21
2      120303    110001      RT    6/1/21   460.0   6/6/21         4/1/21
3      120303    110011      GK    4/1/21     0.0   4/8/21         4/8/21
4      120303    110011      AK    4/8/21   128.0      NaN        4/12/21
5      120303    110011      PL   4/12/21   128.0  4/16/21        4/16/21
6      120303    110011      FB   4/16/21   256.0      NaN        4/28/21
7      120303    110011      RT   4/28/21   256.0   5/4/21            NaN

Back-fill the EndDate values, and then drop duplicates in Amount column (collapsing empty ones into one row). Also now drop the StartDateShift column, and you will get your final dataframe:

df['EndDate'].bfill(inplace=True)
df = df.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate
0      120303    110001      FB   5/18/21   245.0  5/25/21
1      120303    110001      TW   5/25/21   460.0   6/6/21
3      120303    110011      GK    4/1/21     0.0   4/8/21
4      120303    110011      AK    4/8/21   128.0  4/16/21
6      120303    110011      FB   4/16/21   256.0   5/4/21

In above steps, I didn't groupby DocumentId and PersonId columns as the logic worked on your sample df without it. But for your actual use-case, you can apply groupby on DocumentId and PersonId columns and perform all steps above for each group, then concat in the end:

dfs = []
for i,dfg in df.groupby(['DocumentID','PersonID']):
    dfg.Amount.ffill(inplace=True)
    dfg['StartDateShift'] = dfg['StartDate'].shift(-1)
    dfg['EndDate'].fillna(dfg[~(dfg['Amount']==dfg['Amount'].shift(-1))]['StartDateShift'], inplace=True)
    dfg['EndDate'].bfill(inplace=True)
    dfg = dfg.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)
    dfs.append(dfg)

final_df = pd.concat(dfs)

>>> final_df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate
0      120303    110001      FB   5/18/21   245.0  5/25/21
1      120303    110001      TW   5/25/21   460.0   6/6/21
3      120303    110011      GK    4/1/21     0.0   4/8/21
4      120303    110011      AK    4/8/21   128.0  4/16/21
6      120303    110011      FB   4/16/21   256.0   5/4/21
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @Ank Really appreciate you helping out!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.