2

EDIT: I have seen all of the questions on SA for this and they all give me the error I'm asking about here- please can you leave it open so I can get some help?

I have a file I can read very simply with Bash like this: gzip -d -c my_file.json.gz | jq . This confirms that it is valid JSON. But when I try to read it using Python like so:

import json
import gzip
with gzip.open('my_file.json.gz') as f:
    data = f.read() # returns a byte string `b'`
json.loads(data)

I get the error:

json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1632)

But I know it is valid JSON from my Bash command. I have been stuck on this seemingly simple problem for a long time now and have tried everything it feels like. Can anyone help? Thank you.

2
  • If your problem is reproducible even after you fix the binary error, please edit this to (probably fix that red herring and) provide a minimal reproducible example with data which exhibits the problem. With the diagnostics you have provided, we can only conclude that Python's JSON parser is more strict that the one in jq. In particular, jq tolerates input with multiple JSON structures each on a separate line, but that's not valid JSON. Commented Mar 18, 2022 at 11:47
  • I updated with another duplicate to explain that part. Commented Mar 18, 2022 at 11:55

2 Answers 2

6

Like the documentation tells you, gzip.open() returns a binary file handle by default. Pass in an rt mode to read the data as text:

with gzip.open("my_file.json.gz", mode="rt") as f:
    data = f.read()

... or separately .decode() the binary data (you then obviously have to know or guess its encoding).

If your input file contains multiple JSON records on separate lines (called "JSON lines" or "JSONS"), where each is separately a valid JSON structure, jq can handle that without any extra options, but Python's json module needs you to specify your requirement in more detail, perhaps like this:

with gzip.open("my_file.json.gz", mode="rt") as f:
    data = [json.loads(line) for line in f]
Sign up to request clarification or add additional context in comments.

1 Comment

By definition, JSON is UTF-8, but of course there are random amateur tools which produce pseudo-JSON with some random legacy 8-bit encoding, most probably Windows-1252 but YMMV.
0

It's the read mode and the decode that need to be modified/specified

Sample code

import gzip

f=gzip.open('a.json.gz','rb')
file_content=f.read()
print(file_content.decode())

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.