Write bytes literal with undefined character to CSV file (Python 3)

Question

Using Python 3.4.2, I want to get a part of a website. According to the meta tags, that website is encoded with iso-8859-1. And I want to write one part (along with other parts) to a CSV file.

However, this part contains an undefined character with the hex value 0x8b. In order to preserve the part as good as possible, I want to write it as is into the CSV file. However, Python doesn't let me do it.

Here's a minimal example:

import urllib.request
import urllib.parse
import csv

if __name__ == "__main__":
    with open("bytewrite.csv", "w", newline="") as csvfile:
        a = b'\x8b' # byte literal by urllib.request
        b = a.decode("iso-8859-1")

        w = csv.writer(csvfile)
        w.writerow([b])

And this is the output:

Traceback (most recent call last):
  File "D:\Eigene\Dateien\Code\Python\writebyte.py", line 12, in <module>
    w.writerow([b])
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 0: character maps to <undefined>

Eventually, I did it manually. It was just copy and paste with Notepad++, and according to a hex editor the value was inserted correctly. But how can I do it with Python 3? Why does Python even care what 0x8b stands for, instead of just writing it to the file?

It further irritates me that according to iso8859_1.py (and also cp1252.py) in C:\Python34\lib\encodings\ the lookup table seems to not interfere:

# iso8859_1.py
    '\x8b'     #  0x8B -> <control>
# cp1252.py
    '\u2039'   #  0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK

As opposed to Python 2, where the csv writer has to be opened in binary mode, you can't open the csv writer in binary mode in Python 3. After removing the newline argument, which you can't define in binary mode, the exact error is: TypeError: 'str' does not support the buffer interface — user2009388
– user2009388, Commented Feb 14, 2015 at 4:07

Mark Tolonen · Accepted Answer · 2015-02-14 00:12:15Z

3

Quoted from csv docs:

Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.

What is happening is you've decoded to Unicode from iso-8859-1, but getpreferredencoding() returns cp1252 and the Unicode character \x8b is not supported in that encoding.

Corrected minimal example:

import csv
with open('bytewrite.csv', 'w', encoding='iso-8859-1', newline='') as csvfile:
    a = b'\x8b'
    b = a.decode("iso-8859-1")
    w = csv.writer(csvfile)
    w.writerow([b])

edited Feb 14, 2015 at 0:12

answered Feb 14, 2015 at 0:06

Mark Tolonen

181k26 gold badges183 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2009388 Over a year ago

Short, concise, working and well explained answer. It works. Thanks a lot!

Steven Kryskalla · Accepted Answer · 2015-02-14 00:50:41Z

0

Your interpretation of the lookup tables in encodings is not correct. The code you've listed:

# iso8859_1.py
    '\x8b'     #  0x8B -> <control>
# cp1252.py
    '\u2039'   #  0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK

Tells you two things:

How to map the unicode character '\x8b' to bytes in iso8859-1, it's just a control character.
How to map the unicode character '\u2039' to bytes in cp1252, it's a piece of punctuation: ‹

This does not tell you how to map the unicode character '\x8b' to bytes in cp1252, which is what you're trying to do.

The root of the problem is that "\x8b" is not a valid iso8859-1 character. Look at the table here:

http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout

8b is undefined, so it just decodes as a control character. After it's decoded and we're in unicode land, what is 0x8b? This is a little tricky to find out, but it's defined in the unicode database here:

008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;

Now, does CP1252 have this control character, "PARTIAL LINE FORWARD"?

http://en.wikipedia.org/wiki/Windows-1252#Code_page_layout

No, it does not. So you get an error when trying to encode it in CP1252.

Unfortunately there's no good solution for this. Some ideas:

Guess what encoding the page actually is. It's probably CP1252, not ISO-8859-1, but who knows. It could even contain a mix of encodings, or incorrectly encoded data (mojibake). You can use chardet to guess the encoding, or force this URL to use CP1252 in your program (overriding what the meta tag says), or you could try a series of codecs and take the first one that decodes & encodes successfully.
Fix up the input text or the decoded unicode string using some kind of mapping of problematic characters like this. This will work most of the time, but will fail silently or do something weird if you're trying to "fix up" data where it doesn't make sense.
Do not try to convert from ISO-8859-1 to CP1252, as they aren't compatible with each other. If you use UTF-8 that might work better.
Use an encoding error handler. See this table for a list of handlers. Using xmlcharrefreplace and backslashreplace will preserve the information (but then require you to do extra steps when decoding), while replace and ignore will silently skip over the bad character.

These types of issues caused by older encodings are really hard to solve, and there is no perfect solution. This is the reason why unicode was invented.

answered Feb 14, 2015 at 0:50

Steven Kryskalla

14.8k2 gold badges42 silver badges42 bronze badges

3 Comments

Mark Ransom Over a year ago

If Mojibake is involved the best course of action is probably to retain the exact binary sequence, just as the OP desires. I don't know if the csv module supports that though. Python 3 is rather insistent on converting things to Unicode.

Steven Kryskalla Over a year ago

Yeah that is true. There's also a library that will try to fix mojibake for you: ftfy.readthedocs.org/en/latest

user2009388 Over a year ago

Thanks for the feedback on my interpretation. It's very interesting. 1) It's very likely that it's mojibake, but I want to preserve the website as good as possible - with all the errors. 2) Good idea. Might need that in the future. 3) Not a possible solution, see 1). 4) These handlers still gave me errors.

Collectives™ on Stack Overflow

Write bytes literal with undefined character to CSV file (Python 3)

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related