My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.
page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:
"<text from pdf document>"
It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.
Here's a little code to illustrate this:
import json
s1 = "This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))
Result:
This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
json.loads(page_content)instead.datashould be the JSON equivalent ofpage_content. So what is the value ofpage_content?data.pdfgave you valid JSON data. Even if the PDF was a print out of JSON data, the chance of getting that data as clean text when pulling it from a PDF is very small. And I don't see that the OP has said anything about what's in the PDF they are reading from.