0

Hello :) I have a big bin file which has been gzipped (so it's a blabla.bin.gz).

I need to decompress and write it to a txt file with ascii format. Here's my code :

import gzip

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    file_content.decode("ascii")
    output = open("new_file.txt", "w", encoding="ascii")
    output.write(file_content)
    output.close()

But I got this error :

file_content.decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 19: ordinal not in range(128)

I'm not so new to Python but format/coding problems have always been my greatest weakness :(

Please, could you help me?

Thank you !!!

7
  • Thought about the possibility the gzipped file was UTF8 or unicode or whatever before? Could you check this? Something not handled with 128bit ascii? Just for giggles: try encoding='utf-8', or just file_content.decode("utf-8") - better get used to utf-8 - its kindof a default nowadays. Commented Dec 18, 2017 at 16:07
  • You should use this instead: docs.python.org/3/library/binascii.html Commented Dec 18, 2017 at 16:10
  • 1
    Does file_content.decode('cp1252') work? 0x94 is a closing curly double quote in cp1252, which is a common encoding on Windows systems. Commented Dec 18, 2017 at 16:14
  • @PatrickArtner (1) ValueError: Argument 'encoding' not supported in binary mode (I'm in binary mode using 'rb') ; (2) I MUST create an ascii file. :( Commented Dec 18, 2017 at 16:14
  • 2
    @usr2564301: beware, cp1252 is close to Latin1 but is not, only Latin1 guarantees that decode/encode is a no-op. Commented Dec 18, 2017 at 16:34

1 Answer 1

2

First, there is no reason for decoding anything to immediatly write it back in raw bytes. So a simpler (and more robust) implementation could be:

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    with open("new_file.txt", "wb") as output:  # just directly write raw bytes
        output.write(file_content)

If you really want to decode but are unsure of the encoding, you could use Latin1. Every byte is valid in Latin1 and is translated in the unicode character of the same value. So whatever is the byte string bs, bs.decode('Latin1').encode('Latin1') is just a copy of bs.

Finaly, if you really need to filter out all non ascii characters, you could use the error parameter of decode:

file_content = file_content.decode("ascii", errors="ignore") # just remove any non ascii byte

or:

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    file_content = file_content.decode("ascii", errors="replace") #non ascii chars are
                                            # replaced with the U+FFFD replacement character
    output = open("new_file.txt", "w", encoding="ascii", errors="replace") # non ascii chars
                                                      # are replaced with a question mark "?"
    output.write(file_content)
    output.close()
Sign up to request clarification or add additional context in comments.

2 Comments

thank you, but it gets me this error : output.write(file_content) TypeError: write() argument must be str, not bytes so basically it still considers file_content as a bin file... but why?
@inTaowetrust: In first solution, file_content is a byte string and the output file is opened in binary mode ("wb"), while in second file_content becomes a unicode string and the file is opened in text mode. Wait... I had forgotten to assign to file_content :-( . Please see my edit

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.