How to read data from a PDF form using python

Question

I need to read data from hundreds of PDF forms. These forms have all text entry boxes, the forms are not editable. I have been trying to use Python and PyPDF2 to read these forms to a CSV file (since the ultimate goal is an excel database.

I have tried using acrobats export as csv function, but this is extremely slow as each form has 4 embedded images that export as plaintext. I have the following code,

from PyPDF2 import PdfFileReader


infile = "FormSample.pdf"

pdf_reader = PdfFileReader(open(infile, "rb"))


with open('exportharvest.csv','w') as exportharvestcsv:

    dictionary = pdf_reader.getFields(fileobj = exportharvestcsv)

textfields = pdf_reader.getFormTextFields()

dest = pdf_reader.getNamedDestinations()

print(dest)

The issue with the above code is as follows: the getFields command only gets the ~4 digital signature fields in the form (form has ~300 entries). Is there some way to instruct python to look through all the fields? I know the field names in the document as they are listed when I export to pdf.

getFormTextFields() returns a dictionary of {}

getNamedDestinations() returns a dictionary of {}

Thanks for any help.

Could it be that the form fields have been made not editable by flattening the form? Flattening makes the form field appearances part if the regular page content stream and removes the abstract fields. That would explain your observations. Unfortunately that would also make extracting the contents hard for you as removing the abstract form fields removes the simple mapping of form field names to form field values from the pdf. — mkl
– mkl, Commented Jul 1, 2019 at 18:56

Nivatius · Accepted Answer · 2020-08-07 12:04:30Z

1

From my experience pyPDF is slow as well. this here should do what you want:

from PyPDF2 import PdfFileReader
from pprint import pprint
pdf_file_name = 'formdocument.pdf'

f = PdfFileReader(pdf_file_name)
fields = f.getFields()
fdfinfo = dict((k, v.get('/V', '')) for k, v in fields.items())
pprint(fdfinfo)



with open('test.csv', 'w') as f2:
    for key in fdfinfo.keys():
        if type(key)==type("string") and type(str(fdfinfo[key]))==type("string"):
            f2.write('"'+key+'","'+fdfinfo[key]+'"\n')

edited Aug 7, 2020 at 12:04

answered Jul 1, 2019 at 16:10

Nivatius

2701 silver badge16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to read data from a PDF form using python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related