Processing Utf-8 data in python

Question

I'm using the python module requests to get data from some API's and they all return json data which are converted to dicts. What I want to do is take some info from these dicts and either convert them all to python strings where I can use the stemming and string.translate() modules on them, or convert the whole thing to data that is recognisable to these modules. I can't do this with the UTF-8 data and it's doing my head in. Is there any solution to this at all? Can I iterate through the dict and convert it to ASCII?

The strange thing is I am comparing ASCII strings to the UTF data in other functions (if ASCII-word is in UTF dict: do something) and it works perfectly. The ASCII value matches the UTF-8 data all the time. I can't get my head around this encoding stuff at all

This fairly short slideshow has been extremely valuable to me in understanding unicode, str and how they work in Python 2. — kojiro
– kojiro, Commented Jul 29, 2012 at 0:21

BrenBarn · Accepted Answer · 2012-07-28 23:52:06Z

3

UTF-8 is an extension of ASCII in that valid 7-bit ASCII text is also valid UTF-8 text, so if all the data is in fact representable in ASCII it doesn't make any difference whether it's ASCII or UTF-8.

If the data coming is UTF-8 encoded, the best approach is to decode it to unicode objects. For example if you read in a string from some source and store it in the variable utf8str, you can do utf8str.decode('utf-8'). Then pass this unicode object around and do all your operations on the unicode object. Instead of string.translate you can use unicode.translate (assuming you're referring to the string method called "translate" there).

If your modules cannot deal with unicode strings, you need to think about how you want to handle that. You have to decide what to do if your input contains characters that can't be represented in ASCII.

answered Jul 28, 2012 at 23:52

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

adohertyd Over a year ago

I can safely say that my data will not contain any non-ASCII characters 99% of the time. And if it does it will not make any difference because I'm trying to find common phrases so they will just be ignored.

adohertyd Over a year ago

You're perfectly correct in what you are saying but I have accepted mhermans' answer because for my needs it is the easiest solution. To redo my code and investigate these other methods is just too much time/effort for what I need it for. Thanks for the input though I will be sure to follow up on what you've said for future reference

mhermans · Accepted Answer · 2012-07-28 23:51:10Z

0

When you are sure the function does not support Unicode, you can always convert to an ASCII-approximation:

ascii_string = unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')

answered Jul 28, 2012 at 23:51

mhermans

2,1674 gold badges20 silver badges32 bronze badges

8 Comments

BrenBarn Over a year ago

This is rarely a good idea, as it will just mask errors. If you have data that is not representable in ASCII, and you are trying to operate on it with a function that only handles ASCII, you have a design problem. Just ignoring characters that don't work won't solve that problem, and will likely lead to corrupted output that will cause headaches down the road.

adohertyd Over a year ago

This might work for me because I will be ignoring any non-common words. Non-ASCII characters being masked/ignored makes little to no difference to this particular program.

mhermans Over a year ago

Generally I agree with you BenBarn. But e.g. in case of string similarity, some algorithms are simple not implemented with Unicode strings, and this conversions is appropriate as some data loss is acceptable. As the poster mentions stemming, it might be acceptable in his/her case too.

adohertyd Over a year ago

@mhermans just regarding the 'NFKD' in the parenteses. What is that? Just a random string?

BrenBarn Over a year ago

Ignoring out-of-range characters in the input is not a good idea for string similarity either. If your string is "the#" where # is some high-Unicode character, then decoding with ASCII and ignoring errors will read it as "the", which will be counted as a common word, even though the input was actually some uncommon word with a strange character in it.

|

Collectives™ on Stack Overflow

Processing Utf-8 data in python

2 Answers 2

2 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related