1

Extracting a table from PDF resulted in the following dataframe:

          Date      Transaction Details  Withdrawals  Deposits   Balance
0   01-01-2020  Tx1-Description - Line1       1625.0       NaN  97994.82
1          NaN                   Line 2          NaN       NaN       NaN
2   01-01-2020  Tx2-Description - Line1          NaN  84994.82  90000.00
3          NaN                   Line 2          NaN       NaN       NaN
4          NaN                   Line 3          NaN       NaN       NaN
5   02-01-2020  Tx3-Description - Line1         71.0       NaN  84923.82
6          NaN                   Line 2          NaN       NaN       NaN
7   02-01-2020  Tx4-Description - Line1          NaN     80.00  90000.00
8          NaN                   Line 2          NaN       NaN       NaN
9          NaN                   Line 3          NaN       NaN       NaN
10  03-01-2020  Tx5-Description - Line1        100.0       NaN  85000.00

How can I merge Transaction Details column correctly?

Desired output:

          Date      Transaction Details              Withdrawals  Deposits  Balance
0   01-01-2020  Tx1-Description - Line1 Line 2         1625.0      NaN       97994.82
1   01-01-2020  Tx2-Description - Line1 Line 2 Line 3  NaN         84994.82  90000.00
2   02-01-2020  Tx3-Description - Line1 Line 2         71.0        NaN       84923.82
3   02-01-2020  Tx4-Description - Line1 Line 2 Line 3  NaN         80.00     90000.00
4   03-01-2020  Tx5-Description - Line1                100.0       NaN       85000.00

2 Answers 2

2

IIUC, you can groupby using the "Date" to form groups, then aggregate:

(df.groupby(df['Date'].notna().cumsum(), as_index=False)
   .agg({'Date': 'first', 'Transaction Details': ' '.join,
         'Withdrawals': 'sum', 'Deposits': 'sum', 'Balance': 'sum'})
)

NB. Note that the NaNs became 0, but you can replace(0, float('nan')) if needed

output:

         Date                    Transaction Details  Withdrawals  Deposits   Balance
0  01-01-2020         Tx1-Description - Line1 Line 2       1625.0      0.00  97994.82
1  01-01-2020  Tx2-Description - Line1 Line 2 Line 3          0.0  84994.82  90000.00
2  02-01-2020         Tx3-Description - Line1 Line 2         71.0      0.00  84923.82
3  02-01-2020  Tx4-Description - Line1 Line 2 Line 3          0.0     80.00  90000.00
4  03-01-2020                Tx5-Description - Line1        100.0      0.00  85000.00
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you...What if I need to preserve the dates?
Awesome...I observed that chaining reset_index() did not remove the index column. I removed that added df.reset_index() after. Thank you so much!
I forgot to remove reset_index from the previous code ;)
1
df1.loc[:,:"Transaction Details"].assign(col1=lambda dd:dd.Date.notna().cumsum())\
    .assign(col2=lambda dd:dd.index)\
    .groupby("col1").agg(**{"Transaction Details":("Transaction Details", "".join),"col2":("col2","first")}).rename_axis(None)\
    .join(df1.drop("Transaction Details",axis=1),on='col2')

output:

         Date                    Transaction Details  Withdrawals  Deposits   Balance
0  01-01-2020         Tx1-Description - Line1 Line 2       1625.0      0.00  97994.82
1  01-01-2020  Tx2-Description - Line1 Line 2 Line 3          0.0  84994.82  90000.00
2  02-01-2020         Tx3-Description - Line1 Line 2         71.0      0.00  84923.82
3  02-01-2020  Tx4-Description - Line1 Line 2 Line 3          0.0     80.00  90000.00
4  03-01-2020                Tx5-Description - Line1        100.0      0.00  85000.00

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.