Background:
I have a PDF file with multiple pages (LARGE_PDF). Every page contains one table and no other content. Every table looks different. I want to extract the table contents and put them to a pandas dataframe. I am using tabula-py for that and it works as desired with the below approach:
Approach:
First, I split the PDF file to multiple single-page PDF files and save them to disc.
single_page_files = split_and_save(LARGE_PDF) # Split to single files, one page each
Second, feed every single file to tabula-py.
from tabula import read_pdf as tabular_read
for item in single_page_files:
print type(item)
df = tabular_read(PDF_page, pandas_options={'header':None})
if df:
print 'approach works'
Output:
>>> <type 'str'> # filepath string
>>> approach works
Challenge:
I now want to do this in-memory, so that no intermediate single-page pdf files are saved to disk. In order to do this, I create a list of single-page PyPDF2.pdf.PageObject objects and feed them to tabula-py.
from PyPDF2 import PdfFileReader, PdfFileWriter
single_page_pypdfobjects = split_but_dont_save(LARGE_PDF)
for item in single_page_pypdfobjects:
print type(item)
df = tabular_read(PDF_page, pandas_options={'header':None})
if df:
print 'approach works'
Output:
>>> class 'PyPDF2.pdf.PageObject'> # PyPDF2 single page object
>>> TypeError: unhashable type
How to process PDFs in-memory using python?