python backslash replacement failing

Question

I'm pulling json data in from a large data file to convert the contents to csv format and I'm getting an error:

Traceback (most recent call last):
  File "python/gamesTXTtoCSV.py", line 99, in <module>
    writer.writerow(foo)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 15: ordinal not in range(128)

After some digging I've found that, the string "\u2013" shows up in the json data file.

Example (see the value field):

"states":[
      {
         "display":null,
         "name":"choiceText",
         "type":"string",
         "value":"Show me around \u2013 as long as your friends don't chase me away again!"
      },

I've tried various methods of string replacement to the script to get rid of the offending string.

Stuff like (where i[value] is the offending field:

 i['value'].replace("\\u2013", "--")

Or

i['value'].replace("\\", "") #this one is the last resort

Or even

i['value'].encode("utf8")

But to no avail - I keep getting the error. Any idea what's going on?

Here's the section of code that writes the csv, in case additional context is needed:

################## filling out the csv ################
openfile= open(inFile)
f = open(outFile, 'wt')
writer = csv.writer(f)
writer.writerow(all_cols)

for row in openfile.readlines():
    line = json.loads(row)
    stateCSVrow= []
    states=line['states']
    contexts=line['context']
    contextCSVrow=[]
    k = 0
    for state in state_names:
        for i in states:
            if i['name']==state:
                i['value'].replace("\u2019", "'") ####THE SECTION GIVING ISSUE
                i['value'].replace("\u2013", "--")
                stateCSVrow.append(i['value'])
        if len(stateCSVrow)==k:
            stateCSVrow.append('NA')
        k +=1
    c = 0
    for context in context_names:
        for i in contexts:
            if i['name']==context:
                contextCSVrow.append(i['value'])
        if len(contextCSVrow)==c:
            contextCSVrow.append('NA')
        c +=1
    first=[]
    first.extend([
        line['key'] ,
        line['timestamp'],
        line['actor']['actorType'],
        line['user']['username'],
        line['version'],
        line['action']['name'],
        line['action']['actionType']
          ])

    foo = first + stateCSVrow + contextCSVrow
    writer.writerow(foo)

Are you using Python 2 or 3? I'm assuming 2, because in 3 your strings should all be unicode anyway. — Wayne Werner
– Wayne Werner, Commented Feb 2, 2016 at 20:22
Okay, so I solved it. Python was behaving in ways that I didn't expect. When it was doing the evaluation on the string i['value'] it was evaluating the character itself "–" (elongated dash) not the character code, even when the character code was present in the json data. This fixed the issue: i['value'] = i['value'].encode('utf8') — JoeM05
– JoeM05, Commented Feb 2, 2016 at 23:02

Wayne Werner · Accepted Answer · 2016-02-02 20:21:33Z

1

You're trying to replace the repr of a unicode escape sequence, don't do that.

In [3]: x = 'fnord \u2034'

In [4]: x
Out[4]: 'fnord ‴'

In [5]: x.replace('\u2034', 'hi')
Out[5]: 'fnord hi'

_{^{(IPython with 3.5 on Arch Linux)}}

It works the same in Python2:

⚘ python2
Python 2.7.11 (default, Dec  6 2015, 15:43:46)
[GCC 5.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "Show me around \u2013 as long as your friends don't chase me away again!"
>>> x
"Show me around \\u2013 as long as your friends don't chase me away again!"
>>> x.replace('\u2013', '--')
"Show me around -- as long as your friends don't chase me away again!"

answered Feb 2, 2016 at 20:21

Wayne Werner

52.3k35 gold badges213 silver badges304 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

JoeM05 Over a year ago

I'm a little inexperienced with python. What I'm doing looks the same as what you are doing, as far as I can tell.

Wayne Werner Over a year ago

Note that you're not replacing anything on your for context in context_names block - you may have some unicode chars there that are causing you trouble.

JoeM05 Over a year ago

That may be, but the error is consistently generated from the state_names - I'm still not sure how to fix the issue. What I'm doing looks the same as what you are suggestion.

Wayne Werner Over a year ago

The error you pasted is generated from writerow. You are writing foo = first + stateCSVrow + contextCSVrow. Do a print first; print stateCSVrow; print contextCSVrow before you write out your row and I can pretty much guarantee that you'll see a \u2013 in the contextCSVrow (unless you're missing a replace somewhere else)

Wayne Werner Over a year ago

*doh. - .replace is not an in-place operation. You need i['value'] = i['value'].replace(....)

|

Collectives™ on Stack Overflow

python backslash replacement failing

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related