1

I am making an API call and the response has unicode characters. Loading this response into a file throws the following error:

'ascii' codec can't encode character u'\u2019' in position 22462

I've tried all combinations of decode and encode ('utf-8').

Here is the code:

url = "https://%s?start_time=%s&include=metric_sets,users,organizations,groups" % (api_path, start_epoch)
while url != None and url != "null" :
json_filename = "%s/%s.json" % (inbound_folder, start_epoch)
try:
    resp = requests.get(url,
                        auth=(api_user, api_pwd),
                        headers={'Content-Type': 'application/json'})

except requests.exceptions.RequestException as e:
    print "|********************************************************|"
    print e
    return "Error: {}".format(e)
    print "|********************************************************|"
    sys.exit(1)

try:
    total_records_extracted = total_records_extracted + rec_cnt
    jsonfh = open(json_filename, 'w')
    inter = resp.text
    string_e = inter#.decode('utf-8')
    final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
    encoded_data = final.encode('utf-8')
    cleaned_data = json.loads(encoded_data)
    json.dump(cleaned_data, jsonfh, indent=None)
    jsonfh.close()
except ValueError as e:
    tb = traceback.format_exc()
    print tb
    print "|********************************************************|"
    print  e
    print "|********************************************************|"
    sys.exit(1)

Lot of developers have faced this issue. a lot of places have asked to use .decode('utf-8') or having a # _*_ coding:utf-8 _*_ at the top of python.

It is still not helping.

Can someone help me with this issue?

Here is the trace:

Traceback (most recent call last):
File "/Users/SM/PycharmProjects/zendesk/zendesk_tickets_api.py", line 102, in main
cleaned_data = json.loads(encoded_data)
File "/Users/SM/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 2826494 (char 2826493)

|********************************************************|
Invalid \escape: line 1 column 2826494 (char 2826493)
6
  • Just to be sure, in which line the interpreter stops? Commented Jun 15, 2016 at 23:21
  • 1
    Please provide the full stack trace. Commented Jun 15, 2016 at 23:21
  • Are you using requests? if so just use resp.json() if you want json Commented Jun 15, 2016 at 23:22
  • Please add the traceback to the question to preserve the formatting. It's unreadable in the comment. Commented Jun 15, 2016 at 23:54
  • How is inter initialized? The error code implies it is a Unicode string and you are using Python 2. On Python 2 that Unicode string is implicitly encoded using the default ascii codec to a byte string because .decode() only works on byte strings. Please provide an MCVE. Commented Jun 16, 2016 at 16:24

1 Answer 1

1
inter = resp.text
string_e = inter#.decode('utf-8')
encoded_data = final.encode('utf-8')

The text property is a Unicode character string, decoded from the original bytes using whatever encoding the Requests module guessed might be in use from the HTTP headers.

You probably don't want that; JSON has its own ideas about what the encoding should be, so you should let the JSON decoder do that by taking the raw response bytes from resp.content and passing them straight to json.loads.

What's more, Requests has a shortcut method to do the same: resp.json().

final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')

Trying to do this on the JSON-string-literal formatted input is a bad idea: you will miss some valid escapes, and incorrectly unescape others. Your actual error is nothing to do with Unicode at all, it's that this replacement is mangling the input. For example consider the input JSON:

{"message": "Open the file C:\\newfolder\\text.txt"}

after replacement:

{"message": "Open the file C:\ ewfolder\ ext.txt"}

which is clearly not valid JSON.

Instead of trying to operate on the JSON-encoded string, you should let json decode the input and then filter any strings you have in the structured output. This may involve using a recursive function to walk down into each level of the data looking for strings to filter. eg

def clean(data):
    if isinstance(data, basestring):
        return data.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
    if isinstance(data, list):
        return [clean(item) for item in data]
    if isinstance(data, dict):
        return {clean(key): clean(value) for (key, value) in data.items()}
    return data

cleaned_data = clean(resp.json())
Sign up to request clarification or add additional context in comments.

2 Comments

this helps me in cleaning the escape characters. However, I also want to write the unicode data to the downstream databases. When i try reading the files written by this code, it encounters unicode characters such as XTREME on June 23\u201325
json.loads will convert the \u2013 input into character U+2013 En Dash fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.