3

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:

If i try

urllib.quote('á')

i get

'%C3%A1'

But thats not the correct output, it should be %E1

As demostrated by the tool provided here this site

And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try

urllib.quote('\images\á\some file.jpg')

And then i try with the javascript tool i mentioned i get this strings respectively

%5Cimages%5C%C3%A1%5Csome%20file.jpg

%5Cimages%5C%E1%5Csome%20file.jpg

Note how is almost the same but the url provided by quote doesn't work and the other one it does. I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference. I tried with other spanish words with accents and the ñ they all are differently represented.

Is this a python bug? Do you know some module that get this right?

5
  • 3
    both javascript and python are using the same encoding? Have you tried unicode? repr('á') == "'\\xc3\\xa1'" and repr(u'á') == "u'\\xe1'" Commented Jun 14, 2011 at 2:34
  • @Rob: I'm pretty sure UTF-8 is supposed to be in URLs. Commented Jun 14, 2011 at 2:36
  • 2
    Related: stackoverflow.com/questions/912811/… Commented Jun 14, 2011 at 2:38
  • 1
    0xc3a1 is a UTF-8 representation of LATIN SMALL LETTER A WITH ACUTE. Commented Jun 14, 2011 at 2:38
  • @sarnold oh that helps, now i know i want my urls in unicode not in utf-8, but doing unicode(urllib.quote(string)) is not working. Commented Jun 14, 2011 at 3:51

5 Answers 5

7

According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.

See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.

Sign up to request clarification or add additional context in comments.

4 Comments

could be, but none of my browsers find the resourse with the "updated" encoding.
So, any module that uses the outdated but actually working encoding?
@Guillermo, can you update your server to allow newer HTTP clients to request resources as specified in the newer RFC?
I'm using web.py's internal server can't do anything about that for the time being.
3

Ok, got it, i have to encode to iso-8859-1 like this

word = u'á'
word = word.encode('iso-8859-1')
print word

Comments

1

Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.

Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.

# coding: utf-8

Comments

0

What about using unicode strings and the numeric representation (ord) of the char?

>>> print '%{0:X}'.format(ord(u'á'))
%E1

2 Comments

It's a hack, but a hack may be required for a website that still requires ISO-8859-1. Most webservers are now compliant with UTF-8, as assumed by urllib.
Works but looks like black magic, and it doesnt work with more than 1 character, and looping over all my content just doesnt seem like a good idea.
0

In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

1 Comment

I spoke too soon, those functions do not output the Unicode code point that is needed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.