0

I am using tabula-py to read a PDF with multiple banking transactions. Converting the PDF to CSV yields:

enter image description here

But I need it to be:

enter image description here

Basically, the description is sometimes split.

So far this is what I have done:

import pandas as pd
import tabula

tabula.convert_into(
    'input.pdf', 'output.csv', output_format="csv", pages="all"
)
df = pd.read_csv("./output.csv")

for index, row in df.iterrows():
    datetime = row["datetime"]

    try:
        if pd.isnull(datetime) or (type(datetime) == list):
            prefix = str(df.loc[index]["description"])
            suffix = str(df.loc[index + 2]["description"])

            df.at[index + 1, "description"] = prefix + " " + suffix

            df.drop(index=index, inplace=True)
            df.drop(index=index + 2, inplace=True)
    except KeyError:
        pass

print(df)

So far I've gotten it to work with the solution I've provided. But the problem is this is very slow, and I have to do these tens of thousands of transactions per customer with thousands of customers to process which will take ages. I'm thinking maybe I could use df.apply() but I can't quite wrap my head around having access to rows after the current one.

Here's the CSV:

datetime,description,debit,credit,balance
2018-08-27 08:18 AM,Buy RAM,75.00,,5535.50
,Buy MSI RTX 3080,,,
2018-08-27 05:12 PM,,3000.00,,2539.25
,VENTUS 3X 10G OC graphics,,,
2018-08-28 03:46 PM,Salary,,5.00,2444.25
2018-08-28 03:46 PM,Buy Oranges,100.00,,2439.25
2018-08-28 08:00 PM,Project Compensation,,8550.00,10994.25
2018-08-28 08:04 PM,Buy Tesla Stock,2000.00,,8994.25
,Buy Logitech G502,,,
2018-08-29 10:47 AM,,5900.00,,3094.25
,HERO High Performance Gaming Mouse,,,
,Buy Extra Spicy,,,
2018-08-29 03:58 PM,,628.00,,2466.25
,Tacos,,,

P.S. I have no control over how the formatting of the PDF since the PDFs comes from the bank and they have no external API I could use.

2
  • Rather than trying to fix this error-prone solution, you could try a different one. I have good experiences with converting pdfs to html using poppler-util's pdftohtml, and subsequently loading it to a dataframe directly, or processing it with BeautifulSoup first. It's quite fast as well. Commented Jan 18, 2021 at 21:12
  • Wow, good Idea. I'll definitely check it out. Commented Jan 19, 2021 at 7:22

1 Answer 1

1

I know it's not exactly ideal, but here is one way of doing something close to what you need

>>> df['datetime'] = pd.to_datetime(df['datetime']).ffill()
>>> df.fillna('').groupby('datetime').agg(list)
                                                            description         debit    credit             balance
datetime
2018-08-27 08:18:00                        [Buy RAM, Buy MSI RTX 3080]      [75.0, ]      [, ]          [5535.5, ]
2018-08-27 17:12:00                      [, VENTUS 3X 10G OC graphics]    [3000.0, ]      [, ]         [2539.25, ]
2018-08-28 15:46:00                              [Salary, Buy Oranges]     [, 100.0]   [5.0, ]  [2444.25, 2439.25]
2018-08-28 20:00:00                             [Project Compensation]            []  [8550.0]          [10994.25]
2018-08-28 20:04:00               [Buy Tesla Stock, Buy Logitech G502]    [2000.0, ]      [, ]         [8994.25, ]
2018-08-29 10:47:00  [, HERO High Performance Gaming Mouse, Buy Ext...  [5900.0, , ]    [, , ]       [3094.25, , ]
2018-08-29 15:58:00                                          [, Tacos]     [628.0, ]      [, ]         [2466.25, ]

I guess you can "massage" it a bit further to achieve the exactly required result.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.