0

Background:

I have a PDF file with multiple pages (LARGE_PDF). Every page contains one table and no other content. Every table looks different. I want to extract the table contents and put them to a pandas dataframe. I am using tabula-py for that and it works as desired with the below approach:

Approach:

First, I split the PDF file to multiple single-page PDF files and save them to disc.

single_page_files = split_and_save(LARGE_PDF) # Split to single files, one page each

Second, feed every single file to tabula-py.

from tabula import read_pdf as tabular_read 

for item in single_page_files:                
    print type(item)
    df = tabular_read(PDF_page, pandas_options={'header':None})
    if df:
         print 'approach works'

Output:

>>> <type 'str'>                              # filepath string
>>> approach works

Challenge:

I now want to do this in-memory, so that no intermediate single-page pdf files are saved to disk. In order to do this, I create a list of single-page PyPDF2.pdf.PageObject objects and feed them to tabula-py.

from PyPDF2 import PdfFileReader, PdfFileWriter

single_page_pypdfobjects = split_but_dont_save(LARGE_PDF)
for item in single_page_pypdfobjects:                
    print type(item)
    df = tabular_read(PDF_page, pandas_options={'header':None})
    if df:
         print 'approach works'

Output:

>>> class 'PyPDF2.pdf.PageObject'>             # PyPDF2 single page object
>>> TypeError: unhashable type

How to process PDFs in-memory using python?

1
  • 1
    just take a look here. Commented Feb 22, 2019 at 6:16

1 Answer 1

1

You don't need to split up the PDF. Tabla-py has an option pages to tell it what pages you want to extract from.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your comment. I am aware of that option. Due to various reason, I need to split. I can't change this approach.
Then you must use temporary files as Tabla doesn't support taking anything else as input. Well it does support file like objects but internally it handles them by writing them to a file.
I am surprised. I did not find anything in the docs about that. What about StringIO objects or something similar?
read_pdf calls localize_file which handles file-like objects and URL by fetching them and copying then to a file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.