Convert pdf data to JSON format using Python? [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed last year.

Improve this question

I am trying to print data in JSON format but it is being printed in text format

import PyPDF2
import json

pdf_file = open('data.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()

data = json.dumps(page_content)
print(data)

What are you getting and what did you expect to get instead? — user5386938
– user5386938, Commented Jan 3, 2021 at 5:42
If you aren't getting an error, then data should be the JSON equivalent of page_content. So what is the value of page_content? — CryptoFool
– CryptoFool, Commented Jan 3, 2021 at 6:22
@Niloct - why would you expect that to work? That would only work if extracting text from the PDF file data.pdf gave you valid JSON data. Even if the PDF was a print out of JSON data, the chance of getting that data as clean text when pulling it from a PDF is very small. And I don't see that the OP has said anything about what's in the PDF they are reading from. — CryptoFool
– CryptoFool, Commented Jan 3, 2021 at 6:27
Without seeing 1) what's in the PDF, 2) your current result, and 3) your desired result (all necessary for a minimal reproducible example), it's almost impossible to know what the problem is or how to fix it. — CrazyChucky
– CrazyChucky, Commented Jun 5, 2024 at 0:46

CryptoFool · Accepted Answer · 2021-01-03 06:51:08Z

5

My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.

page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:

"<text from pdf document>"

It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.

Here's a little code to illustrate this:

import json
s1 = "This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))

Result:

This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"

answered Jan 3, 2021 at 6:51

CryptoFool

23.4k5 gold badges31 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

tarun kumar Over a year ago

yes you are right I am getting like Dubble quotes string but its file is in JSON format (data.json) any other library is for formating it in actual JSON formate

CryptoFool Over a year ago

I'm not sure what you're saying. Again, a double-quoted string IS JSON format. Any JSON library should give you the same thing, in whatever language you are using. If you expect something other than what you're getting, please provide what you expect to be getting in your question.

David Allen · Accepted Answer · 2024-02-29 21:20:13Z

3

Simply converting a string with json.dumps() will not yield your desired result, since the string first needs to be split into key-value pairs.

If you need to extract a lot of data from an unstructured PDF, you may want to consider using Adobe's extract PDF Python SDK. The API converts all the structural and text information from a PDF directly into JSON, so you don't have to do it manually.

The JSON data will contain an array of elements with information such as the following:

{
"Page": 1,
"Path": "//Document/P",
"Text": "The quick brown fox jumps over the lazy dog "
}

answered Feb 29, 2024 at 21:20

David Allen

911 silver badge3 bronze badges

Collectives™ on Stack Overflow

Convert pdf data to JSON format using Python? [closed]

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related