0

I'm using Microsoft's free translation service to translate some Hindi characters to English. They don't provide an API for Python, but I borrowed code from: tinyurl.com/dxh6thr

I'm trying to use the 'Detect' method as described here: tinyurl.com/bxkt3we

The 'hindi.txt' file is saved in unicode charset.

>>> hindi_string = open('hindi.txt').read()
>>> data = { 'text' : hindi_string }
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN)
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data))
>>> request.add_header('Authorization', 'Bearer '+token)
>>> response = urllib2.urlopen(request)
>>> print response.read()
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string>
>>>

The response shows that the Translator detected 'en', instead of 'hi' (for Hindi). When I check the encoding, it shows as 'string':

>>> type(hindi_string)
<type 'str'>

For reference, here is content of 'hindi.txt':

हाय, कैसे आप आज कर रहे हैं। मैं अच्छी तरह से, आपको धन्यवाद कर रहा हूँ।

I'm not sure if using string.encode or string.decode applies here. If it does, what do I need to encode/decode from/to? What is the best method to pass a Unicode string as a urllib.urlencode argument? How can I ensure that the actual Hindi characters are passed as the argument?

Thank you.

** Additional Information **

I tried using codecs.open() as suggested, but I get the following error:

>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\codecs.py", line 671, in read
    return self.reader.read(size)
  File "C:\Python27\lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

Here is the repr(hindi_string) output:

>>> repr(hindi_string)
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"
8
  • In which encoding did you save the file? Did you try to use codecs.open instead of plain open to get the file content with the correct encoding? Commented Nov 2, 2012 at 20:48
  • You show hindi_string defined but not hindi. Please show repr(hindi). Commented Nov 2, 2012 at 20:54
  • 1
    Also I highly recommend the requests library for doing any HTTP stuff. Commented Nov 2, 2012 at 21:08
  • @Bakuriu I tried codecs.open() as suggested, but I got the error (updated above) Commented Nov 2, 2012 at 21:10

2 Answers 2

2

Your file is utf-16, so you need to decode the content before sending it:

hindi_string = open('hindi.txt').read().decode('utf-16')
data = { 'text' : hindi_string.encode('utf-8') }
...
Sign up to request clarification or add additional context in comments.

Comments

0

You could try opening the file using codecs.open and decode it with utf-8:

import codecs

with codecs.open('hindi.txt', encoding='utf-8') as f:
    hindi_text = f.read()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.