Python - Decoding Error ('ascii' codec can't decode byte 0x94 in position 19.....)

Question

Hello :) I have a big bin file which has been gzipped (so it's a blabla.bin.gz).

I need to decompress and write it to a txt file with ascii format. Here's my code :

import gzip

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    file_content.decode("ascii")
    output = open("new_file.txt", "w", encoding="ascii")
    output.write(file_content)
    output.close()

But I got this error :

file_content.decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 19: ordinal not in range(128)

I'm not so new to Python but format/coding problems have always been my greatest weakness :(

Please, could you help me?

Thank you !!!

Thought about the possibility the gzipped file was UTF8 or unicode or whatever before? Could you check this? Something not handled with 128bit ascii? Just for giggles: try encoding='utf-8', or just file_content.decode("utf-8") - better get used to utf-8 - its kindof a default nowadays. — Patrick Artner
– Patrick Artner, Commented Dec 18, 2017 at 16:07
You should use this instead: docs.python.org/3/library/binascii.html — F. Leone
– F. Leone, Commented Dec 18, 2017 at 16:10
Does file_content.decode('cp1252') work? 0x94 is a closing curly double quote in cp1252, which is a common encoding on Windows systems. — snakecharmerb
– snakecharmerb, Commented Dec 18, 2017 at 16:14
@PatrickArtner (1) ValueError: Argument 'encoding' not supported in binary mode (I'm in binary mode using 'rb') ; (2) I MUST create an ascii file. :( — in Tao we trust
– in Tao we trust, Commented Dec 18, 2017 at 16:14
@usr2564301: beware, cp1252 is close to Latin1 but is not, only Latin1 guarantees that decode/encode is a no-op. — Serge Ballesta
– Serge Ballesta, Commented Dec 18, 2017 at 16:34

Serge Ballesta · Accepted Answer · 2017-12-18 17:22:10Z

2

First, there is no reason for decoding anything to immediatly write it back in raw bytes. So a simpler (and more robust) implementation could be:

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    with open("new_file.txt", "wb") as output:  # just directly write raw bytes
        output.write(file_content)

If you really want to decode but are unsure of the encoding, you could use Latin1. Every byte is valid in Latin1 and is translated in the unicode character of the same value. So whatever is the byte string bs, bs.decode('Latin1').encode('Latin1') is just a copy of bs.

Finaly, if you really need to filter out all non ascii characters, you could use the error parameter of decode:

file_content = file_content.decode("ascii", errors="ignore") # just remove any non ascii byte

or:

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    file_content = file_content.decode("ascii", errors="replace") #non ascii chars are
                                            # replaced with the U+FFFD replacement character
    output = open("new_file.txt", "w", encoding="ascii", errors="replace") # non ascii chars
                                                      # are replaced with a question mark "?"
    output.write(file_content)
    output.close()

edited Dec 18, 2017 at 17:22

answered Dec 18, 2017 at 16:23

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

in Tao we trust Over a year ago

thank you, but it gets me this error : output.write(file_content) TypeError: write() argument must be str, not bytes so basically it still considers file_content as a bin file... but why?

Serge Ballesta Over a year ago

@inTaowetrust: In first solution, file_content is a byte string and the output file is opened in binary mode ("wb"), while in second file_content becomes a unicode string and the file is opened in text mode. Wait... I had forgotten to assign to file_content :-( . Please see my edit

Collectives™ on Stack Overflow

Python - Decoding Error ('ascii' codec can't decode byte 0x94 in position 19.....)

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related