2

I'm using the python module requests to get data from some API's and they all return json data which are converted to dicts. What I want to do is take some info from these dicts and either convert them all to python strings where I can use the stemming and string.translate() modules on them, or convert the whole thing to data that is recognisable to these modules. I can't do this with the UTF-8 data and it's doing my head in. Is there any solution to this at all? Can I iterate through the dict and convert it to ASCII?

The strange thing is I am comparing ASCII strings to the UTF data in other functions (if ASCII-word is in UTF dict: do something) and it works perfectly. The ASCII value matches the UTF-8 data all the time. I can't get my head around this encoding stuff at all

2
  • This fairly short slideshow has been extremely valuable to me in understanding unicode, str and how they work in Python 2. Commented Jul 29, 2012 at 0:21
  • @kojiro very good piece of info that thank you Commented Jul 29, 2012 at 1:47

2 Answers 2

3

UTF-8 is an extension of ASCII in that valid 7-bit ASCII text is also valid UTF-8 text, so if all the data is in fact representable in ASCII it doesn't make any difference whether it's ASCII or UTF-8.

If the data coming is UTF-8 encoded, the best approach is to decode it to unicode objects. For example if you read in a string from some source and store it in the variable utf8str, you can do utf8str.decode('utf-8'). Then pass this unicode object around and do all your operations on the unicode object. Instead of string.translate you can use unicode.translate (assuming you're referring to the string method called "translate" there).

If your modules cannot deal with unicode strings, you need to think about how you want to handle that. You have to decide what to do if your input contains characters that can't be represented in ASCII.

Sign up to request clarification or add additional context in comments.

2 Comments

I can safely say that my data will not contain any non-ASCII characters 99% of the time. And if it does it will not make any difference because I'm trying to find common phrases so they will just be ignored.
You're perfectly correct in what you are saying but I have accepted mhermans' answer because for my needs it is the easiest solution. To redo my code and investigate these other methods is just too much time/effort for what I need it for. Thanks for the input though I will be sure to follow up on what you've said for future reference
0

When you are sure the function does not support Unicode, you can always convert to an ASCII-approximation:

ascii_string = unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore') 

8 Comments

This is rarely a good idea, as it will just mask errors. If you have data that is not representable in ASCII, and you are trying to operate on it with a function that only handles ASCII, you have a design problem. Just ignoring characters that don't work won't solve that problem, and will likely lead to corrupted output that will cause headaches down the road.
This might work for me because I will be ignoring any non-common words. Non-ASCII characters being masked/ignored makes little to no difference to this particular program.
Generally I agree with you BenBarn. But e.g. in case of string similarity, some algorithms are simple not implemented with Unicode strings, and this conversions is appropriate as some data loss is acceptable. As the poster mentions stemming, it might be acceptable in his/her case too.
@mhermans just regarding the 'NFKD' in the parenteses. What is that? Just a random string?
Ignoring out-of-range characters in the input is not a good idea for string similarity either. If your string is "the#" where # is some high-Unicode character, then decoding with ASCII and ignoring errors will read it as "the", which will be counted as a common word, even though the input was actually some uncommon word with a strange character in it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.