I need to read data from hundreds of PDF forms. These forms have all text entry boxes, the forms are not editable. I have been trying to use Python and PyPDF2 to read these forms to a CSV file (since the ultimate goal is an excel database.
I have tried using acrobats export as csv function, but this is extremely slow as each form has 4 embedded images that export as plaintext. I have the following code,
from PyPDF2 import PdfFileReader
infile = "FormSample.pdf"
pdf_reader = PdfFileReader(open(infile, "rb"))
with open('exportharvest.csv','w') as exportharvestcsv:
dictionary = pdf_reader.getFields(fileobj = exportharvestcsv)
textfields = pdf_reader.getFormTextFields()
dest = pdf_reader.getNamedDestinations()
print(dest)
The issue with the above code is as follows: the getFields command only gets the ~4 digital signature fields in the form (form has ~300 entries). Is there some way to instruct python to look through all the fields? I know the field names in the document as they are listed when I export to pdf.
getFormTextFields() returns a dictionary of {}
getNamedDestinations() returns a dictionary of {}
Thanks for any help.