How to pass Unicode string as argument to urllib.urlencode()

Question

I'm using Microsoft's free translation service to translate some Hindi characters to English. They don't provide an API for Python, but I borrowed code from: tinyurl.com/dxh6thr

I'm trying to use the 'Detect' method as described here: tinyurl.com/bxkt3we

The 'hindi.txt' file is saved in unicode charset.

>>> hindi_string = open('hindi.txt').read()
>>> data = { 'text' : hindi_string }
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN)
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data))
>>> request.add_header('Authorization', 'Bearer '+token)
>>> response = urllib2.urlopen(request)
>>> print response.read()
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string>
>>>

The response shows that the Translator detected 'en', instead of 'hi' (for Hindi). When I check the encoding, it shows as 'string':

>>> type(hindi_string)
<type 'str'>

For reference, here is content of 'hindi.txt':

हाय, कैसे आप आज कर रहे हैं। मैं अच्छी तरह से, आपको धन्यवाद कर रहा हूँ।

I'm not sure if using string.encode or string.decode applies here. If it does, what do I need to encode/decode from/to? What is the best method to pass a Unicode string as a urllib.urlencode argument? How can I ensure that the actual Hindi characters are passed as the argument?

Thank you.

** Additional Information **

I tried using codecs.open() as suggested, but I get the following error:

>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\codecs.py", line 671, in read
    return self.reader.read(size)
  File "C:\Python27\lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

Here is the repr(hindi_string) output:

>>> repr(hindi_string)
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"

In which encoding did you save the file? Did you try to use codecs.open instead of plain open to get the file content with the correct encoding? — Bakuriu
– Bakuriu, Commented Nov 2, 2012 at 20:48
You show hindi_string defined but not hindi. Please show repr(hindi). — Eryk Sun
– Eryk Sun, Commented Nov 2, 2012 at 20:54
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). — Katriel
– Katriel, Commented Nov 2, 2012 at 21:04
Also I highly recommend the requests library for doing any HTTP stuff. — Katriel
– Katriel, Commented Nov 2, 2012 at 21:08
@Bakuriu I tried codecs.open() as suggested, but I got the error (updated above) — Logic Al
– Logic Al, Commented Nov 2, 2012 at 21:10

mata · Accepted Answer · 2012-11-02 21:18:40Z

2

Your file is utf-16, so you need to decode the content before sending it:

hindi_string = open('hindi.txt').read().decode('utf-16')
data = { 'text' : hindi_string.encode('utf-8') }
...

answered Nov 2, 2012 at 21:18

mata

69.3k10 gold badges168 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nathan Villaescusa · Accepted Answer · 2012-11-02 20:51:03Z

0

You could try opening the file using codecs.open and decode it with utf-8:

import codecs

with codecs.open('hindi.txt', encoding='utf-8') as f:
    hindi_text = f.read()

edited Nov 2, 2012 at 20:51

answered Nov 2, 2012 at 20:42

Nathan Villaescusa

17.7k4 gold badges55 silver badges58 bronze badges

Collectives™ on Stack Overflow

How to pass Unicode string as argument to urllib.urlencode()

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related