PDF to Pandas Data Frame

Question

Just when I think I am finally getting it, such a newb.

I am trying to get a list of numbers from a column from a table that is an PDF.

First step I wanted to convert to a Panda DF.

pip install tabula-py
pip install PyPDF2

import pandas as pd
import tabula
df = tabula.read_pdf('/content/Manifest.pdf')

The output I get however is a list of 1, not a DF. When I look at DF the info is there, I just have no idea how access it as it is a list of 1.

So not sure why I didnt get a DF and no idea what I meant to do with a list of 1.Output

Not sure if it matters but I am using google Colab.

Any help would be awesome.

Thanks

Hey, since you're new check out How to Ask. You shouldn't be including pictures/images of code. Additionally its difficult to determine what what df should be taking if we don't have sample input (ie the pdf). Also, what is it exactly that you want in your output? Check out the docs for tabula tabula-py.readthedocs.io/en/latest/tabula.html, specifically look at the return type of the function read_pdf() — eNc
– eNc, Commented Jun 20, 2020 at 2:13
Thanks for the info, lot to learn, and looks like how to ask questions correctly is one of them. Cheers — 5 8
– 5 8, Commented Jul 4, 2020 at 7:45

Bhupinder Singh Narang · Accepted Answer · 2020-06-20 02:37:03Z

4

tabula.read_pdf returns the list of dataframes without any additional arguments. To access your specific dataframe, you can select the index and use it.

Here's an example where I have read the document and selected the very first index and compared the types

import tabula

df = tabula.read_pdf(
    "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf")

df_0 = df[0]

print("type of df :", type(df))
print("type of df_0", type(df_0))

Returns:

type of df : <class 'list'>
type of df_0 <class 'pandas.core.frame.DataFrame'>

edited Jun 20, 2020 at 2:37

answered Jun 20, 2020 at 2:28

Bhupinder Singh Narang

3653 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Qas Over a year ago

I have a side question regarding Tabula. It throws an error when I run the tabula.pdf_read() command. An error about "JVM cache being full and to increase its size". Any ideas how to debug this issue?

Mustafa ShazLy · Accepted Answer · 2020-10-03 13:30:26Z

-1

Try something as df = tabula.read_pdf('/content/Manifest.pdf', sep=' ')

answered Oct 3, 2020 at 13:30

Mustafa ShazLy

911 silver badge11 bronze badges

Collectives™ on Stack Overflow

PDF to Pandas Data Frame

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related