Unicode API response throwing error ''ascii' codec can't encode character u'\u2019' in position 22462'

Question

I am making an API call and the response has unicode characters. Loading this response into a file throws the following error:

'ascii' codec can't encode character u'\u2019' in position 22462

I've tried all combinations of decode and encode ('utf-8').

Here is the code:

url = "https://%s?start_time=%s&include=metric_sets,users,organizations,groups" % (api_path, start_epoch)
while url != None and url != "null" :
json_filename = "%s/%s.json" % (inbound_folder, start_epoch)
try:
    resp = requests.get(url,
                        auth=(api_user, api_pwd),
                        headers={'Content-Type': 'application/json'})

except requests.exceptions.RequestException as e:
    print "|********************************************************|"
    print e
    return "Error: {}".format(e)
    print "|********************************************************|"
    sys.exit(1)

try:
    total_records_extracted = total_records_extracted + rec_cnt
    jsonfh = open(json_filename, 'w')
    inter = resp.text
    string_e = inter#.decode('utf-8')
    final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
    encoded_data = final.encode('utf-8')
    cleaned_data = json.loads(encoded_data)
    json.dump(cleaned_data, jsonfh, indent=None)
    jsonfh.close()
except ValueError as e:
    tb = traceback.format_exc()
    print tb
    print "|********************************************************|"
    print  e
    print "|********************************************************|"
    sys.exit(1)

Lot of developers have faced this issue. a lot of places have asked to use .decode('utf-8') or having a # _*_ coding:utf-8 _*_ at the top of python.

It is still not helping.

Can someone help me with this issue?

Here is the trace:

Traceback (most recent call last):
File "/Users/SM/PycharmProjects/zendesk/zendesk_tickets_api.py", line 102, in main
cleaned_data = json.loads(encoded_data)
File "/Users/SM/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 2826494 (char 2826493)

|********************************************************|
Invalid \escape: line 1 column 2826494 (char 2826493)

Are you using requests? if so just use resp.json() if you want json — Padraic Cunningham
– Padraic Cunningham, Commented Jun 15, 2016 at 23:22
Please add the traceback to the question to preserve the formatting. It's unreadable in the comment. — David Ferenczy Rogožan
– David Ferenczy Rogožan, Commented Jun 15, 2016 at 23:54
How is inter initialized? The error code implies it is a Unicode string and you are using Python 2. On Python 2 that Unicode string is implicitly encoded using the default ascii codec to a byte string because .decode() only works on byte strings. Please provide an MCVE. — Mark Tolonen
– Mark Tolonen, Commented Jun 16, 2016 at 16:24

bobince · Accepted Answer · 2016-06-16 22:39:39Z

1

inter = resp.text
string_e = inter#.decode('utf-8')
encoded_data = final.encode('utf-8')

The text property is a Unicode character string, decoded from the original bytes using whatever encoding the Requests module guessed might be in use from the HTTP headers.

You probably don't want that; JSON has its own ideas about what the encoding should be, so you should let the JSON decoder do that by taking the raw response bytes from resp.content and passing them straight to json.loads.

What's more, Requests has a shortcut method to do the same: resp.json().

final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')

Trying to do this on the JSON-string-literal formatted input is a bad idea: you will miss some valid escapes, and incorrectly unescape others. Your actual error is nothing to do with Unicode at all, it's that this replacement is mangling the input. For example consider the input JSON:

{"message": "Open the file C:\\newfolder\\text.txt"}

after replacement:

{"message": "Open the file C:\ ewfolder\ ext.txt"}

which is clearly not valid JSON.

Instead of trying to operate on the JSON-encoded string, you should let json decode the input and then filter any strings you have in the structured output. This may involve using a recursive function to walk down into each level of the data looking for strings to filter. eg

def clean(data):
    if isinstance(data, basestring):
        return data.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
    if isinstance(data, list):
        return [clean(item) for item in data]
    if isinstance(data, dict):
        return {clean(key): clean(value) for (key, value) in data.items()}
    return data

cleaned_data = clean(resp.json())

answered Jun 16, 2016 at 22:39

bobince

538k111 gold badges675 silver badges846 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user6471969 Over a year ago

this helps me in cleaning the escape characters. However, I also want to write the unicode data to the downstream databases. When i try reading the files written by this code, it encounters unicode characters such as XTREME on June 23\u201325

bobince Over a year ago

json.loads will convert the \u2013 input into character U+2013 En Dash – fine.

Collectives™ on Stack Overflow

Unicode API response throwing error ''ascii' codec can't encode character u'\u2019' in position 22462'

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related