1

I'm trying to open and read .gz file, and keep getting the error -

zlib.error: Error -3 while decompressing data: invalid distance too far back

The file that I'm trying to read were created line by line, using this code

with gzip.open(output_path + out_fn, 'a') as fout:
      json_str = json.dumps(json.loads(data)) + "\n"
      json_bytes = json_str.encode('utf-8')
      fout.write(json_bytes)

which ran after each time new streaming data came (to 'data' var).

This is the scrip that I used to read the .gz file -

with gzip.open(file_path) as f:
    new_lines = []
    for line in f:
        new_l = line.decode('utf-8')
        new_l = json.loads(new_l)
        new_lines.append(new_l)

The error is raised from the 'for' line, after successfully reading some lines, so there might be something wrong with specific lines.

I'll be ok with only skipping those problematic lines if possible, or of course fixing the entire file.


Edit:

I've uploaded the .gz file here - https://www.dropbox.com/s/wp34maf8n8wb5ur/tweets.gz?dl=0

I couldn't find a smaller example sorry.

5
  • 1
    I could not reproduce your error (in a Windows machine). If you start with a new compressed file, does the error still happens? Do you have some minimal test data that causes the error to provide? Commented Jun 29, 2022 at 7:19
  • 1
    @nonDucor - example op gave has "a" as file mode opening. If op would run the script few times, multiple "gzip archives" would be added to a single file, which most likely could lead to the issue .. maybe Commented Jun 29, 2022 at 7:26
  • The gzip format is supposed to handle concatenated files. I, too, failed to reproduce this, and agree that a minimal and complete example would be interesting. Commented Jun 29, 2022 at 7:46
  • Thanks for trying! I've uploaded the problematic file, sorry for it's size. Commented Jun 29, 2022 at 13:06
  • It is possible that you corrupted the file by running your update lines from multiple threads or multiple processes. Commented Jun 29, 2022 at 22:43

1 Answer 1

1

You're getting "invalid distance too far back" because there is an invalid distance that's too far back. Your gzip file is corrupted. Something happened to it between when you created it and when you tried to read it (and uploaded it to dropbox).

Your gzip file is sequence of many small gzip members, each of which can be decompressed individually. Upon encountering the error, you can search for the start of the next gzip member and try decompressing from there. Each gzip member starts with the bytes 1f 8b 08. If you run into another error, then try again from there with another search and decompress.

Having done that, it looks like on the order of 10% of the gzip members in that file are bad.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! Once I split the file using the header you said, I was able to skip the bad gzips.
You should take some care. Those three bytes could, just by chance, appear in the middle of compressed data. In fact, for a file this large that's pretty much guaranteed to happen several times. If you're just splitting every time you see those three bytes, you may break gzip members that are otherwise perfectly fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.