How to process PDFs in-memory using python?

Question

Background:

I have a PDF file with multiple pages (LARGE_PDF). Every page contains one table and no other content. Every table looks different. I want to extract the table contents and put them to a pandas dataframe. I am using tabula-py for that and it works as desired with the below approach:

Approach:

First, I split the PDF file to multiple single-page PDF files and save them to disc.

single_page_files = split_and_save(LARGE_PDF) # Split to single files, one page each

Second, feed every single file to tabula-py.

from tabula import read_pdf as tabular_read 

for item in single_page_files:                
    print type(item)
    df = tabular_read(PDF_page, pandas_options={'header':None})
    if df:
         print 'approach works'

Output:

>>> <type 'str'>                              # filepath string
>>> approach works

Challenge:

I now want to do this in-memory, so that no intermediate single-page pdf files are saved to disk. In order to do this, I create a list of single-page PyPDF2.pdf.PageObject objects and feed them to tabula-py.

from PyPDF2 import PdfFileReader, PdfFileWriter

single_page_pypdfobjects = split_but_dont_save(LARGE_PDF)
for item in single_page_pypdfobjects:                
    print type(item)
    df = tabular_read(PDF_page, pandas_options={'header':None})
    if df:
         print 'approach works'

Output:

>>> class 'PyPDF2.pdf.PageObject'>             # PyPDF2 single page object
>>> TypeError: unhashable type

How to process PDFs in-memory using python?

just take a look here.

Vikas Periyadath
– Vikas Periyadath

2019-02-22 06:16:14 +00:00
Commented Feb 22, 2019 at 6:16 — Vikas Periyadath
– Vikas Periyadath, Commented Feb 22, 2019 at 6:16

Dan D. · Accepted Answer · 2019-02-22 05:39:06Z

1

You don't need to split up the PDF. Tabla-py has an option pages to tell it what pages you want to extract from.

answered Feb 22, 2019 at 5:39

Dan D.

75k15 gold badges111 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

sudonym Over a year ago

Thanks for your comment. I am aware of that option. Due to various reason, I need to split. I can't change this approach.

Dan D. Over a year ago

Then you must use temporary files as Tabla doesn't support taking anything else as input. Well it does support file like objects but internally it handles them by writing them to a file.

sudonym Over a year ago

I am surprised. I did not find anything in the docs about that. What about StringIO objects or something similar?

Dan D. Over a year ago

read_pdf calls localize_file which handles file-like objects and URL by fetching them and copying then to a file.

Collectives™ on Stack Overflow

How to process PDFs in-memory using python?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related