2

I'm parsing a CSV as follows:

with open(args.csv, 'rU') as csvfile:
        try:
            reader = csv.DictReader(csvfile, dialect=csv.QUOTE_NONE)
            for row in reader:
            ...

where args.csv is the name of my file. One of the rows in my file is an e with two dots on top. My script breaks when it encounters this.

I get the following stack trace:

File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 244, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)

and the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 5: invalid start byte

FWIW, I'm running Python 2.7 and upgrading isn't an option (for a few reasons).

I'm pretty lost about how to fix this so any help is much appreciated.

Thanks!

6
  • What if you try with open(args.csv, 'rU', encoding='utf-8') as csvfile: ? Commented Jun 24, 2016 at 17:53
  • You could add some data from the csv file maybe as hexdump. Could it be the file is not meaningfully interpretable as utf8 because it was encoded to bytes from some windows or other encodings? Commented Jun 24, 2016 at 17:57
  • 1
    The dots are called an umlaut Commented Jun 24, 2016 at 20:09
  • 1
    The error doesn't come from the code, it comes from call to json.dumps Commented Jun 24, 2016 at 20:27
  • and you should mention the Python 2.7 as a tag. Commented Jun 24, 2016 at 20:29

1 Answer 1

11

Byte 0x91 is a "smart" opening single quote in Windows-1252 encoding. So it sounds like that's the encoding your file is using, not UTF-8. So, use open(args.csv, 'rU', encoding='windows-1252').

Sign up to request clarification or add additional context in comments.

3 Comments

When I follow your answer, I get: "TypeError: 'encoding' is an invalid keyword argument for this function". Fwiw, I'm running Python 2.7 and (for a few reasons) can't change that.
@bclayman It is preferable that you mention that in your question, even though it is mentioned in the stacktrace.
Great answer! I managed to convert a file in Uzbek language to UTF-8 iconv -t UTF-8 -f Windows-1252 in.xml I would've spent a lot of time guessing what 0x91 and 0x92 character mean.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.