How to extract PDF fields from a filled out form in Python?

Question

I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

I've tried:

The pdfminer demo: it didn't dump any of the filled out data.
pyPdf: it maxed a core for 2 minutes when I tried to load the file with PdfFileReader(f) and I just gave up and killed it.
Jython and PDFBox: got that working great but the startup time is excessive, I'll just write an external utility in straight Java if that's my only option.

I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.

Update: Based on Steven's answer I looked into pdfminer and it did the trick nicely.

from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1, PDFObjRef

def load_form(filename):
    """Load pdf form contents into a nested list of name/value tuples"""
    with open(filename, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument(parser)
        return [load_fields(resolve1(f)) for f in
                   resolve1(doc.catalog['AcroForm'])['Fields']]

def load_fields(field):
    """Recursively load form fields"""
    form = field.get('Kids', None)
    if form:
        return [load_fields(resolve1(f)) for f in form]
    else:
        # Some field types, like signatures, need extra resolving
        return (field.get('T').decode('utf-16'), resolve1(field.get('V')))

def parse_cli():
    """Load command line arguments"""
    parser = ArgumentParser(description='Dump the form contents of a PDF.')
    parser.add_argument('file', metavar='pdf_form',
                    help='PDF Form to dump the contents of')
    parser.add_argument('-o', '--out', help='Write output to file',
                      default=None, metavar='FILE')
    parser.add_argument('-p', '--pickle', action='store_true', default=False,
                      help='Format output for python consumption')
    return parser.parse_args()

def main():
    args = parse_cli()
    form = load_form(args.file)
    if args.out:
        with open(args.out, 'w') as outfile:
            if args.pickle:
                pickle.dump(form, outfile)
            else:
                pp = pprint.PrettyPrinter(indent=2)
                outfile.write(pp.pformat(form))
    else:
        if args.pickle:
            print(pickle.dumps(form))
        else:
            pp = pprint.PrettyPrinter(indent=2)
            pp.pprint(form)

if __name__ == '__main__':
    main()

As a note, I also tried using pdftk as an external utility and it didn't get past the owner password. — Olson
– Olson, Commented Oct 21, 2010 at 3:09

Sleep Deprived Bulbasaur · Accepted Answer · 2016-12-01 21:20:31Z

52

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = sys.argv[1]
fp = open(filename, 'rb')

parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()

edited Dec 1, 2016 at 21:20

Sleep Deprived Bulbasaur

2,4684 gold badges25 silver badges35 bronze badges

answered Oct 21, 2010 at 8:48

Steven

28.9k6 gold badges64 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Olson Over a year ago

That did the trick, thank you. I saw the web demo and figured I could see if what I wanted was in there and if not I could skip it. Turns out not only can it do exactly way I want, it can even handle the signature fields that PdfBox can't.

Basil Over a year ago

I have an encoding problem. Using field.get('V') does not encode special characters like 'ü' or 'ä' properly. Does anyone have a solution to this? Converting the string to unicode raises a decoding error.

joshua Over a year ago

In the current version of pdfminer the PDFDocument.initialize method has been removed. This code works if you just remove that line.

Kim Ryan Over a year ago

This line causes an error from pdfminer.pdfdocument import PDFDocument Should use from pdfminer.pdfparser import PDFParser, PDFDocument Also get this error: Traceback (most recent call last): File "so_2.py", line 12, in <module> fields = resolve1(doc.catalog['AcroForm'])['Fields'] TypeError: 'NoneType' object is not subscriptable

Ciro Santilli OurBigBook.com Over a year ago

Works! Tested with this Latex input: tex.stackexchange.com/a/366238/19083

|

Mark Z. · Accepted Answer · 2024-08-13 10:31:42Z

20

Python 3.6+:

pip install PyPDF2

# -*- coding: utf-8 -*-


def get_fields(obj, tree=None, retval=None, fileobj=None):
    """
    Extracts field data if this PDF contains interactive form fields.
    The *tree* and *retval* parameters are for recursive use.

    :param fileobj: A file object (usually a text file) to write
        a report to on all interactive form fields found.
    :return: A dictionary where each key is a field name, and each
        value is a :class:`Field<PyPDF2.generic.Field>` object. By
        default, the mapping name is used for keys.
    :rtype: dict, or ``None`` if form data could not be located.
    """
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                    '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        # get the AcroForm tree
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._check_kids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            # Tree is a field
            obj._build_field(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.get_object()
            obj._build_field(field, retval, fileobj, fieldAttributes)

    return retval


def get_form_fields(infile):
    infile = PdfReader(open(infile, 'rb'))
    fields = get_fields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())



if __name__ == '__main__':
    from pprint import pprint

    pdf_file_name = 'FormExample.pdf'

    pprint(get_form_fields(pdf_file_name))

edited Aug 13, 2024 at 10:31

Mark Z.

2,4971 gold badge24 silver badges36 bronze badges

answered Apr 28, 2017 at 12:41

dvska

2,7491 gold badge22 silver badges14 bronze badges

2 Comments

Raghav Over a year ago

thanks for the detailed solution!! however, for my form, I keep getting empty field list !! does it matter which tool was used to created the form ? mine were created using Adobe lifecycle

Jinhua Wang Over a year ago

This answer saved my day!

Wtower · Accepted Answer · 2023-01-21 18:20:54Z

17

The Python PyPDF2 package (successor to pyPdf) is very convenient:

import PyPDF2
f = PyPDF2.PdfReader('form.pdf')
ff = f.get_fields()

Then ff is a dict that contains all the relevant form information.

edited Jan 21, 2023 at 18:20

Wtower

20.1k12 gold badges110 silver badges86 bronze badges

answered Jan 11, 2018 at 16:07

equaeghe

1,81422 silver badges39 bronze badges

2 Comments

Cam Over a year ago

This worked for me f = PyPDF2.PdfFileReader(the_path) ff = f.getFields()

MaKaNu Over a year ago

PdfFileReader is Deprecated since version 3.0.0

Jason Sundram · Accepted Answer · 2012-01-31 06:50:01Z

Quick and dirty 2-minute job; just use PDFminer to convert PDF to xml and then grab all of the fields.

from xml.etree import ElementTree
from pprint import pprint
import os

def main():
    print "Calling PDFDUMP.py"
    os.system("dumppdf.py -a FILE.pdf > out.xml")

    # Preprocess the file to eliminate bad XML.
    print "Screening the file"
    o = open("output.xml","w") #open for append
    for line in open("out.xml"):
       line = line.replace("&#", "Invalid_XML") #some bad data in xml for formatting info.
       o.write(line) 
    o.close()

    print "Opening XML output"
    tree = ElementTree.parse('output.xml')
    lastnode = ""
    lastnode2 = ""
    list = {}
    entry = {}

    for node in tree.iter(): # Run through the tree..        
        # Check if New node
        if node.tag == "key" and node.text == "T":
            lastnode = node.tag + node.text
        elif lastnode == "keyT":
            for child in node.iter():
                entry["ID"] = child.text
            lastnode = ""

        if node.tag == "key" and node.text == "V":
            lastnode2 = node.tag + node.text
        elif lastnode2 == "keyV":
            for child in node.iter():
                if child.tag == "string":
                    if entry.has_key("ID"):
                        entry["Value"] = child.text
                        list[entry["ID"]] = entry["Value"]
                        entry = {}
            lastnode2 = ""

    pprint(list)

if __name__ == '__main__':
  main()

It isn't pretty, just a simple proof of concept. I need to implement it for a system I'm working on so I will be cleaning it up, but I thought I would post it in case anyone finds it useful.

vossman77 · Accepted Answer · 2015-12-09 20:28:04Z

3

Update for latest version of pdf miner (change import and parser/doc setup in first function)

from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.pdftypes import PDFObjRef

def load_form(filename):
    """Load pdf form contents into a nested list of name/value tuples"""
    with open(filename, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument(parser)
        parser.set_document(doc)
        #doc.set_parser(parser)
        doc.initialize()
        return [load_fields(resolve1(f)) for f in
            resolve1(doc.catalog['AcroForm'])['Fields']]

def load_fields(field):
    """Recursively load form fields"""
    form = field.get('Kids', None)
    if form:
        return [load_fields(resolve1(f)) for f in form]
    else:
        # Some field types, like signatures, need extra resolving
        return (field.get('T').decode('utf-8'), resolve1(field.get('V')))

def parse_cli():
    """Load command line arguments"""
    parser = ArgumentParser(description='Dump the form contents of a PDF.')
    parser.add_argument('file', metavar='pdf_form',
        help='PDF Form to dump the contents of')
    parser.add_argument('-o', '--out', help='Write output to file',
        default=None, metavar='FILE')
    parser.add_argument('-p', '--pickle', action='store_true', default=False,
        help='Format output for python consumption')
    return parser.parse_args()

def main():
    args = parse_cli()
    form = load_form(args.file)
    if args.out:
        with open(args.out, 'w') as outfile:
            if args.pickle:
                pickle.dump(form, outfile)
            else:
                pp = pprint.PrettyPrinter(indent=2)
                file.write(pp.pformat(form))
    else:
        if args.pickle:
            print pickle.dumps(form)
        else:
            pp = pprint.PrettyPrinter(indent=2)
            pp.pprint(form)

if __name__ == '__main__':
    main()

answered Dec 9, 2015 at 20:28

vossman77

1,42715 silver badges14 bronze badges

3 Comments

user2067030 Over a year ago

Where do you put the filename so the script can run ?

Raghav Over a year ago

if you see parse_cli is picking up filename from command line parameters .. you can alter that function to pass your filename !

Raghav Over a year ago

for my pdf file, I dont see any details being available to the parser !! does it matter what created the pdf file ?

Tyler Houssian · Accepted Answer · 2021-03-25 21:58:17Z

3

I created a library to do this: pip install fillpdf

from fillpdf import fillpdfs
fillpdfs.get_form_fields("ex.pdf")

Credit to dvska's answer, for basis of library code.

answered Mar 25, 2021 at 21:58

Tyler Houssian

4656 silver badges8 bronze badges

Comments

Michael Gaskill · Accepted Answer · 2016-07-13 23:38:24Z

0

There is a typo on these lines:

file.write(pp.pformat(form))

Should be:

outfile.write(pp.pformat(form))

edited Jul 13, 2016 at 23:38

Michael Gaskill

8,06210 gold badges40 silver badges46 bronze badges

answered Jul 13, 2016 at 22:54

Shane

1

Collectives™ on Stack Overflow

How to extract PDF fields from a filled out form in Python?

7 Answers 7

6 Comments

2 Comments

2 Comments

Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

6 Comments

2 Comments

2 Comments

Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related