Python3: dealing with UTF8 incompatible characters in CSV output

Question

I'm on Python3.2 and have an SQL output I'm writing to CSV files with a 'Name' identifier and a 'specifics'. For some data from China, people's names (and thus Chinese characters) are being inserted. I've done my best to read through the unicode/decoding docs but I'm at a loss at how to reform/remove these characters holistically in-line within my Python.

I'm running through the file like so:

import csv, os, os.path
rfile = open(nonbillabletest2.csv,'r',newline='')
dataread= csv.reader(rfile)
trash=next(rfile) #ignores the header line in csv:

#Process the target CSV by creating an output with a unique filename per CompanyName
for line in dataread:
    [CompanyName,Specifics] = line
    #Check that a target csv does not exist
    if os.path.exists('test leads '+CompanyName+'.csv') < 1:
        wfile= open('test leads '+CompanyName+'.csv','a')
        datawrite= csv.writer(wfile, lineterminator='\n')
        datawrite.writerow(['CompanyName','Specifics']) #write new header row in each file created
        datawrite.writerow([CompanyName,Specifics])
wfile.close()    
rfile.close()

I receive this error:

Traceback (most recent call last):
  File "C:\Users\Matt\Dropbox\nonbillable\nonbillabletest.py", line 26, in <module>
    for line in dataread:
  File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1886: character maps to <undefined>

Examining the file contents, clearly some non-UTF8 characters:

print(repr(open('nonbillabletest2.csv', 'rb').read()))

b'CompanyName,Specifics\r\neGENTIC,\x86\xac\xff; \r\neGENTIC,\x86\xac\xff; \r\neGENTIC,
\x86\xac\xff; \r\neGENTIC,\x91\x9d?; \r\neGENTIC,\x86\xac\xff; \r\n'

Incorporating a 'encoding=utf8' does not resolve the issue. I have been able to remove individual characters with ...replace('\x86\xac\xff', '')), but I'd have to do this for every character I could encouter, which isn't efficient.

If there's a SQL solution that would be fine, too. Please help!

Update: I've removed the characters using string.printable as was suggested. I had one more error because there was always one final row in the 'contents' section. Adding a if len=0 check took care of that, however.

Thanks a lot for all your fast help!

Why do you say non-UTF8 characters? More exactly, do you know what encoding is the .csv file using? — Paulo Bu
– Paulo Bu, Commented Jul 12, 2013 at 18:30
Do you know the source of the CSV? It looks to me like those are UTF-16 encoded characters. At least, the pattern '\x86\xac\xff;' decodes to two codepoints for what I think are Chinese characters ('\uac86\u3bff') and the `\x91\x9d' decodes to one such codepoint as well. But the rest of the test clearly isn't utf-16. Might it have been built by sloppy concatenation using some non-unicode aware tool? — Peter DeGlopper
– Peter DeGlopper, Commented Jul 12, 2013 at 18:53
'non utf8' is probably incorrect and really refers more to my unfamiliarity of this type of problem. — Matt D
– Matt D, Commented Jul 12, 2013 at 20:12
I think it actually is correct - your file has byte sequences that are meaningless in utf-8 intermixed in it. If you're generating the CSV using something you can control this could be a fairly easy fix. Also, for a language independent breakdown of how this stuff works I highly recommend this essay: joelonsoftware.com/articles/Unicode.html - it doesn't specifically cover Python's unicode support but I've never seen something better for a clear explanation of Unicode and the difference between Unicode and encodings. — Peter DeGlopper
– Peter DeGlopper, Commented Jul 12, 2013 at 20:17

sjbrown · Accepted Answer · 2013-07-12 19:25:37Z

1

So nonbillabletest2.csv is not encoded in UTF-8.

You could:

Fix it upstream. Ensure that it comes to you properly encoded as UTF-8, like you expect. This may be the "SQL solution" you refer to.

Remove all the non-ascii characters beforehand (which, for purists, corrupts the data, but by what you've said, it seems like that's acceptable to you)

import csv, os, string
rfile = open('nonbillabletest2.csv', 'rb')
rbytes = rfile.read()
rfile.close()

contents = ''
for b in rbytes:
  if chr(b) in string.printable + string.whitespace:
    contents += chr(b)

dataread = csv.reader(contents.split('\r\n'))
....

answered Jul 12, 2013 at 19:25

sjbrown

5703 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Matt D Over a year ago

The contents are relatively unimportant in the scheme of things, but with 99% of datapoints they are pulled in correctly. Removing the chars as you suggested is just fine -- glad to learn about string.printable!

Matt D Over a year ago

That said, incorporating your change leaves me with this, despite the fact that it is correctly generating each row. A bit more help, please :) --: "Traceback (most recent call last): File "C:/Python32/nonbillable/overflowsuggestion1.py", line 19, in <module> [CompanyName,Specifics] = line ValueError: need more than 0 values to unpack

Matt D Over a year ago

This despite the fact that it is coded what I think is correctly: print(line) returns "['CompanyName', 'Specifics']"...

Collectives™ on Stack Overflow

Python3: dealing with UTF8 incompatible characters in CSV output

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related