How to read a json.gz file using Python? [duplicate]

Question

EDIT: I have seen all of the questions on SA for this and they all give me the error I'm asking about here- please can you leave it open so I can get some help?

I have a file I can read very simply with Bash like this: gzip -d -c my_file.json.gz | jq . This confirms that it is valid JSON. But when I try to read it using Python like so:

import json
import gzip
with gzip.open('my_file.json.gz') as f:
    data = f.read() # returns a byte string `b'`
json.loads(data)

I get the error:

json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1632)

But I know it is valid JSON from my Bash command. I have been stuck on this seemingly simple problem for a long time now and have tried everything it feels like. Can anyone help? Thank you.

If your problem is reproducible even after you fix the binary error, please edit this to (probably fix that red herring and) provide a minimal reproducible example with data which exhibits the problem. With the diagnostics you have provided, we can only conclude that Python's JSON parser is more strict that the one in jq. In particular, jq tolerates input with multiple JSON structures each on a separate line, but that's not valid JSON. — tripleee
– tripleee, Commented Mar 18, 2022 at 11:47

tripleee · Accepted Answer · 2022-07-07 04:33:37Z

6

Like the documentation tells you, gzip.open() returns a binary file handle by default. Pass in an rt mode to read the data as text:

with gzip.open("my_file.json.gz", mode="rt") as f:
    data = f.read()

... or separately .decode() the binary data (you then obviously have to know or guess its encoding).

If your input file contains multiple JSON records on separate lines (called "JSON lines" or "JSONS"), where each is separately a valid JSON structure, jq can handle that without any extra options, but Python's json module needs you to specify your requirement in more detail, perhaps like this:

with gzip.open("my_file.json.gz", mode="rt") as f:
    data = [json.loads(line) for line in f]

edited Jul 7, 2022 at 4:33

answered Mar 18, 2022 at 11:24

tripleee

192k37 gold badges318 silver badges369 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tripleee Over a year ago

By definition, JSON is UTF-8, but of course there are random amateur tools which produce pseudo-JSON with some random legacy 8-bit encoding, most probably Windows-1252 but YMMV.

madmatrix · Accepted Answer · 2022-03-18 11:25:53Z

0

It's the read mode and the decode that need to be modified/specified

Sample code

import gzip

f=gzip.open('a.json.gz','rb')
file_content=f.read()
print(file_content.decode())

answered Mar 18, 2022 at 11:25

madmatrix

2651 gold badge4 silver badges15 bronze badges

Collectives™ on Stack Overflow

How to read a json.gz file using Python? [duplicate]

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related