2

Just when I think I am finally getting it, such a newb.

I am trying to get a list of numbers from a column from a table that is an PDF.

First step I wanted to convert to a Panda DF.

pip install tabula-py
pip install PyPDF2

import pandas as pd
import tabula
df = tabula.read_pdf('/content/Manifest.pdf')

The output I get however is a list of 1, not a DF. When I look at DF the info is there, I just have no idea how access it as it is a list of 1.

So not sure why I didnt get a DF and no idea what I meant to do with a list of 1.Output

Not sure if it matters but I am using google Colab.

Any help would be awesome.

Thanks

2
  • Hey, since you're new check out How to Ask. You shouldn't be including pictures/images of code. Additionally its difficult to determine what what df should be taking if we don't have sample input (ie the pdf). Also, what is it exactly that you want in your output? Check out the docs for tabula tabula-py.readthedocs.io/en/latest/tabula.html, specifically look at the return type of the function read_pdf() Commented Jun 20, 2020 at 2:13
  • Thanks for the info, lot to learn, and looks like how to ask questions correctly is one of them. Cheers Commented Jul 4, 2020 at 7:45

2 Answers 2

4

tabula.read_pdf returns the list of dataframes without any additional arguments. To access your specific dataframe, you can select the index and use it.

Here's an example where I have read the document and selected the very first index and compared the types

import tabula

df = tabula.read_pdf(
    "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf")

df_0 = df[0]

print("type of df :", type(df))
print("type of df_0", type(df_0))

Returns:

type of df : <class 'list'>
type of df_0 <class 'pandas.core.frame.DataFrame'>
Sign up to request clarification or add additional context in comments.

1 Comment

I have a side question regarding Tabula. It throws an error when I run the tabula.pdf_read() command. An error about "JVM cache being full and to increase its size". Any ideas how to debug this issue?
-1

Try something as df = tabula.read_pdf('/content/Manifest.pdf', sep=' ')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.