0

I am trying to print data in JSON format but it is being printed in text format

import PyPDF2
import json

pdf_file = open('data.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()

data = json.dumps(page_content)
print(data)
6
  • What are you getting and what did you expect to get instead? Commented Jan 3, 2021 at 5:42
  • 1
    Try json.loads(page_content) instead. Commented Jan 3, 2021 at 6:21
  • If you aren't getting an error, then data should be the JSON equivalent of page_content. So what is the value of page_content? Commented Jan 3, 2021 at 6:22
  • 1
    @Niloct - why would you expect that to work? That would only work if extracting text from the PDF file data.pdf gave you valid JSON data. Even if the PDF was a print out of JSON data, the chance of getting that data as clean text when pulling it from a PDF is very small. And I don't see that the OP has said anything about what's in the PDF they are reading from. Commented Jan 3, 2021 at 6:27
  • 2
    Without seeing 1) what's in the PDF, 2) your current result, and 3) your desired result (all necessary for a minimal reproducible example), it's almost impossible to know what the problem is or how to fix it. Commented Jun 5, 2024 at 0:46

2 Answers 2

5

My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.

page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:

"<text from pdf document>"

It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.

Here's a little code to illustrate this:

import json
s1 = "This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))

Result:

This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
Sign up to request clarification or add additional context in comments.

2 Comments

yes you are right I am getting like Dubble quotes string but its file is in JSON format (data.json) any other library is for formating it in actual JSON formate
I'm not sure what you're saying. Again, a double-quoted string IS JSON format. Any JSON library should give you the same thing, in whatever language you are using. If you expect something other than what you're getting, please provide what you expect to be getting in your question.
3

Simply converting a string with json.dumps() will not yield your desired result, since the string first needs to be split into key-value pairs.

If you need to extract a lot of data from an unstructured PDF, you may want to consider using Adobe's extract PDF Python SDK. The API converts all the structural and text information from a PDF directly into JSON, so you don't have to do it manually.

The JSON data will contain an array of elements with information such as the following:

{
"Page": 1,
"Path": "//Document/P",
"Text": "The quick brown fox jumps over the lazy dog "
}

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.