Merge Python Dataframe rows that are split in multiple rows

Question

I am using tabula-py to read a PDF with multiple banking transactions. Converting the PDF to CSV yields:

But I need it to be:

Basically, the description is sometimes split.

So far this is what I have done:

import pandas as pd
import tabula

tabula.convert_into(
    'input.pdf', 'output.csv', output_format="csv", pages="all"
)
df = pd.read_csv("./output.csv")

for index, row in df.iterrows():
    datetime = row["datetime"]

    try:
        if pd.isnull(datetime) or (type(datetime) == list):
            prefix = str(df.loc[index]["description"])
            suffix = str(df.loc[index + 2]["description"])

            df.at[index + 1, "description"] = prefix + " " + suffix

            df.drop(index=index, inplace=True)
            df.drop(index=index + 2, inplace=True)
    except KeyError:
        pass

print(df)

So far I've gotten it to work with the solution I've provided. But the problem is this is very slow, and I have to do these tens of thousands of transactions per customer with thousands of customers to process which will take ages. I'm thinking maybe I could use df.apply() but I can't quite wrap my head around having access to rows after the current one.

Here's the CSV:

datetime,description,debit,credit,balance
2018-08-27 08:18 AM,Buy RAM,75.00,,5535.50
,Buy MSI RTX 3080,,,
2018-08-27 05:12 PM,,3000.00,,2539.25
,VENTUS 3X 10G OC graphics,,,
2018-08-28 03:46 PM,Salary,,5.00,2444.25
2018-08-28 03:46 PM,Buy Oranges,100.00,,2439.25
2018-08-28 08:00 PM,Project Compensation,,8550.00,10994.25
2018-08-28 08:04 PM,Buy Tesla Stock,2000.00,,8994.25
,Buy Logitech G502,,,
2018-08-29 10:47 AM,,5900.00,,3094.25
,HERO High Performance Gaming Mouse,,,
,Buy Extra Spicy,,,
2018-08-29 03:58 PM,,628.00,,2466.25
,Tacos,,,

P.S. I have no control over how the formatting of the PDF since the PDFs comes from the bank and they have no external API I could use.

Rather than trying to fix this error-prone solution, you could try a different one. I have good experiences with converting pdfs to html using poppler-util's pdftohtml, and subsequently loading it to a dataframe directly, or processing it with BeautifulSoup first. It's quite fast as well. — RJ Adriaansen
– RJ Adriaansen, Commented Jan 18, 2021 at 21:12

Enthusiast · Accepted Answer · 2021-01-19 09:19:48Z

I know it's not exactly ideal, but here is one way of doing something close to what you need

>>> df['datetime'] = pd.to_datetime(df['datetime']).ffill()
>>> df.fillna('').groupby('datetime').agg(list)
                                                            description         debit    credit             balance
datetime
2018-08-27 08:18:00                        [Buy RAM, Buy MSI RTX 3080]      [75.0, ]      [, ]          [5535.5, ]
2018-08-27 17:12:00                      [, VENTUS 3X 10G OC graphics]    [3000.0, ]      [, ]         [2539.25, ]
2018-08-28 15:46:00                              [Salary, Buy Oranges]     [, 100.0]   [5.0, ]  [2444.25, 2439.25]
2018-08-28 20:00:00                             [Project Compensation]            []  [8550.0]          [10994.25]
2018-08-28 20:04:00               [Buy Tesla Stock, Buy Logitech G502]    [2000.0, ]      [, ]         [8994.25, ]
2018-08-29 10:47:00  [, HERO High Performance Gaming Mouse, Buy Ext...  [5900.0, , ]    [, , ]       [3094.25, , ]
2018-08-29 15:58:00                                          [, Tacos]     [628.0, ]      [, ]         [2466.25, ]

I guess you can "massage" it a bit further to achieve the exactly required result.

Collectives™ on Stack Overflow

Merge Python Dataframe rows that are split in multiple rows

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related