0

I'm pulling json data in from a large data file to convert the contents to csv format and I'm getting an error:

Traceback (most recent call last):
  File "python/gamesTXTtoCSV.py", line 99, in <module>
    writer.writerow(foo)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 15: ordinal not in range(128)

After some digging I've found that, the string "\u2013" shows up in the json data file.

Example (see the value field):

"states":[
      {
         "display":null,
         "name":"choiceText",
         "type":"string",
         "value":"Show me around \u2013 as long as your friends don't chase me away again!"
      },

I've tried various methods of string replacement to the script to get rid of the offending string.

Stuff like (where i[value] is the offending field:

 i['value'].replace("\\u2013", "--")

Or

i['value'].replace("\\", "") #this one is the last resort

Or even

i['value'].encode("utf8")

But to no avail - I keep getting the error. Any idea what's going on?

Here's the section of code that writes the csv, in case additional context is needed:

################## filling out the csv ################
openfile= open(inFile)
f = open(outFile, 'wt')
writer = csv.writer(f)
writer.writerow(all_cols)

for row in openfile.readlines():
    line = json.loads(row)
    stateCSVrow= []
    states=line['states']
    contexts=line['context']
    contextCSVrow=[]
    k = 0
    for state in state_names:
        for i in states:
            if i['name']==state:
                i['value'].replace("\u2019", "'") ####THE SECTION GIVING ISSUE
                i['value'].replace("\u2013", "--")
                stateCSVrow.append(i['value'])
        if len(stateCSVrow)==k:
            stateCSVrow.append('NA')
        k +=1
    c = 0
    for context in context_names:
        for i in contexts:
            if i['name']==context:
                contextCSVrow.append(i['value'])
        if len(contextCSVrow)==c:
            contextCSVrow.append('NA')
        c +=1
    first=[]
    first.extend([
        line['key'] ,
        line['timestamp'],
        line['actor']['actorType'],
        line['user']['username'],
        line['version'],
        line['action']['name'],
        line['action']['actionType']
          ])

    foo = first + stateCSVrow + contextCSVrow
    writer.writerow(foo)
3
  • Are you using Python 2 or 3? I'm assuming 2, because in 3 your strings should all be unicode anyway. Commented Feb 2, 2016 at 20:22
  • What about codecs.open(file, mode, encoding) ? Commented Feb 2, 2016 at 21:00
  • Okay, so I solved it. Python was behaving in ways that I didn't expect. When it was doing the evaluation on the string i['value'] it was evaluating the character itself "–" (elongated dash) not the character code, even when the character code was present in the json data. This fixed the issue: i['value'] = i['value'].encode('utf8') Commented Feb 2, 2016 at 23:02

1 Answer 1

1

You're trying to replace the repr of a unicode escape sequence, don't do that.

In [3]: x = 'fnord \u2034'

In [4]: x
Out[4]: 'fnord ‴'

In [5]: x.replace('\u2034', 'hi')
Out[5]: 'fnord hi'

(IPython with 3.5 on Arch Linux)

It works the same in Python2:

⚘ python2
Python 2.7.11 (default, Dec  6 2015, 15:43:46)
[GCC 5.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "Show me around \u2013 as long as your friends don't chase me away again!"
>>> x
"Show me around \\u2013 as long as your friends don't chase me away again!"
>>> x.replace('\u2013', '--')
"Show me around -- as long as your friends don't chase me away again!"
Sign up to request clarification or add additional context in comments.

7 Comments

I'm a little inexperienced with python. What I'm doing looks the same as what you are doing, as far as I can tell.
Note that you're not replacing anything on your for context in context_names block - you may have some unicode chars there that are causing you trouble.
That may be, but the error is consistently generated from the state_names - I'm still not sure how to fix the issue. What I'm doing looks the same as what you are suggestion.
The error you pasted is generated from writerow. You are writing foo = first + stateCSVrow + contextCSVrow. Do a print first; print stateCSVrow; print contextCSVrow before you write out your row and I can pretty much guarantee that you'll see a \u2013 in the contextCSVrow (unless you're missing a replace somewhere else)
*doh. - .replace is not an in-place operation. You need i['value'] = i['value'].replace(....)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.