I am using tabula-py to read a PDF with multiple banking transactions. Converting the PDF to CSV yields:
But I need it to be:
Basically, the description is sometimes split.
So far this is what I have done:
import pandas as pd
import tabula
tabula.convert_into(
'input.pdf', 'output.csv', output_format="csv", pages="all"
)
df = pd.read_csv("./output.csv")
for index, row in df.iterrows():
datetime = row["datetime"]
try:
if pd.isnull(datetime) or (type(datetime) == list):
prefix = str(df.loc[index]["description"])
suffix = str(df.loc[index + 2]["description"])
df.at[index + 1, "description"] = prefix + " " + suffix
df.drop(index=index, inplace=True)
df.drop(index=index + 2, inplace=True)
except KeyError:
pass
print(df)
So far I've gotten it to work with the solution I've provided. But the problem is this is very slow, and I have to do these tens of thousands of transactions per customer with thousands of customers to process which will take ages.
I'm thinking maybe I could use df.apply() but I can't quite wrap my head around having access to rows after the current one.
Here's the CSV:
datetime,description,debit,credit,balance
2018-08-27 08:18 AM,Buy RAM,75.00,,5535.50
,Buy MSI RTX 3080,,,
2018-08-27 05:12 PM,,3000.00,,2539.25
,VENTUS 3X 10G OC graphics,,,
2018-08-28 03:46 PM,Salary,,5.00,2444.25
2018-08-28 03:46 PM,Buy Oranges,100.00,,2439.25
2018-08-28 08:00 PM,Project Compensation,,8550.00,10994.25
2018-08-28 08:04 PM,Buy Tesla Stock,2000.00,,8994.25
,Buy Logitech G502,,,
2018-08-29 10:47 AM,,5900.00,,3094.25
,HERO High Performance Gaming Mouse,,,
,Buy Extra Spicy,,,
2018-08-29 03:58 PM,,628.00,,2466.25
,Tacos,,,
P.S. I have no control over how the formatting of the PDF since the PDFs comes from the bank and they have no external API I could use.

