0

I'm on Python3.2 and have an SQL output I'm writing to CSV files with a 'Name' identifier and a 'specifics'. For some data from China, people's names (and thus Chinese characters) are being inserted. I've done my best to read through the unicode/decoding docs but I'm at a loss at how to reform/remove these characters holistically in-line within my Python.

I'm running through the file like so:

import csv, os, os.path
rfile = open(nonbillabletest2.csv,'r',newline='')
dataread= csv.reader(rfile)
trash=next(rfile) #ignores the header line in csv:

#Process the target CSV by creating an output with a unique filename per CompanyName
for line in dataread:
    [CompanyName,Specifics] = line
    #Check that a target csv does not exist
    if os.path.exists('test leads '+CompanyName+'.csv') < 1:
        wfile= open('test leads '+CompanyName+'.csv','a')
        datawrite= csv.writer(wfile, lineterminator='\n')
        datawrite.writerow(['CompanyName','Specifics']) #write new header row in each file created
        datawrite.writerow([CompanyName,Specifics])
wfile.close()    
rfile.close()

I receive this error:

Traceback (most recent call last):
  File "C:\Users\Matt\Dropbox\nonbillable\nonbillabletest.py", line 26, in <module>
    for line in dataread:
  File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1886: character maps to <undefined>

Examining the file contents, clearly some non-UTF8 characters:

print(repr(open('nonbillabletest2.csv', 'rb').read()))

b'CompanyName,Specifics\r\neGENTIC,\x86\xac\xff; \r\neGENTIC,\x86\xac\xff; \r\neGENTIC,
\x86\xac\xff; \r\neGENTIC,\x91\x9d?; \r\neGENTIC,\x86\xac\xff; \r\n'

Incorporating a 'encoding=utf8' does not resolve the issue. I have been able to remove individual characters with ...replace('\x86\xac\xff', '')), but I'd have to do this for every character I could encouter, which isn't efficient.

If there's a SQL solution that would be fine, too. Please help!


Update: I've removed the characters using string.printable as was suggested. I had one more error because there was always one final row in the 'contents' section. Adding a if len=0 check took care of that, however.

Thanks a lot for all your fast help!

5
  • Why do you say non-UTF8 characters? More exactly, do you know what encoding is the .csv file using? Commented Jul 12, 2013 at 18:30
  • Do you know the source of the CSV? It looks to me like those are UTF-16 encoded characters. At least, the pattern '\x86\xac\xff;' decodes to two codepoints for what I think are Chinese characters ('\uac86\u3bff') and the `\x91\x9d' decodes to one such codepoint as well. But the rest of the test clearly isn't utf-16. Might it have been built by sloppy concatenation using some non-unicode aware tool? Commented Jul 12, 2013 at 18:53
  • 'non utf8' is probably incorrect and really refers more to my unfamiliarity of this type of problem. Commented Jul 12, 2013 at 20:12
  • 1
    I think it actually is correct - your file has byte sequences that are meaningless in utf-8 intermixed in it. If you're generating the CSV using something you can control this could be a fairly easy fix. Also, for a language independent breakdown of how this stuff works I highly recommend this essay: joelonsoftware.com/articles/Unicode.html - it doesn't specifically cover Python's unicode support but I've never seen something better for a clear explanation of Unicode and the difference between Unicode and encodings. Commented Jul 12, 2013 at 20:17
  • Thanks, Peter. I'll check it out. Commented Jul 12, 2013 at 20:41

1 Answer 1

1

So nonbillabletest2.csv is not encoded in UTF-8.

You could:

  1. Fix it upstream. Ensure that it comes to you properly encoded as UTF-8, like you expect. This may be the "SQL solution" you refer to.
  2. Remove all the non-ascii characters beforehand (which, for purists, corrupts the data, but by what you've said, it seems like that's acceptable to you)

    import csv, os, string
    rfile = open('nonbillabletest2.csv', 'rb')
    rbytes = rfile.read()
    rfile.close()
    
    contents = ''
    for b in rbytes:
      if chr(b) in string.printable + string.whitespace:
        contents += chr(b)
    
    dataread = csv.reader(contents.split('\r\n'))
    ....
    
Sign up to request clarification or add additional context in comments.

3 Comments

The contents are relatively unimportant in the scheme of things, but with 99% of datapoints they are pulled in correctly. Removing the chars as you suggested is just fine -- glad to learn about string.printable!
That said, incorporating your change leaves me with this, despite the fact that it is correctly generating each row. A bit more help, please :) --: "Traceback (most recent call last): File "C:/Python32/nonbillable/overflowsuggestion1.py", line 19, in <module> [CompanyName,Specifics] = line ValueError: need more than 0 values to unpack
This despite the fact that it is coded what I think is correctly: print(line) returns "['CompanyName', 'Specifics']"...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.