Pdf to XML/json using Python module

Question

I am able to read the text from pdf using code:

import pdfx
pdf = pdfx.PDFx("1951.pdf")
metadata = pdf.get_metadata()
reference_list = pdf.get_references()
reference_dict = pdf.get_references_as_dict()
pdf.download_pdfs("D:/")
pdf.get_text()

But can't convert it into json:

pdfx -d D:/Output/ -j -o output.json pdf
SyntaxError: invalid syntax

Syntax: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

shikha singh · Accepted Answer · 2018-05-31 04:29:36Z

8

I was able to convert to XML using the Pdfminer Python module.

Download and unzip the module from http://pypi.python.org/pypi/pdfminer/
Run on shell: python pdf2txt.py -o samples/output.xml -t xml samples/1951.pdf
For text : python pdf2txt.py samples/1951.pdf

edited May 31, 2018 at 4:29

answered May 30, 2018 at 18:00

shikha singh

4981 gold badge5 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Arpit Over a year ago

you are using pdf2txt .py, Is this file persent in pdfminer module ?

shikha singh Over a year ago

Yes pdf2txt.py comes with pdfminer module

Mehmet Kurtipek Over a year ago

it comes with pdfminer but located under Scripts folder of Python

Collectives™ on Stack Overflow

Pdf to XML/json using Python module

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related