how to url-safe encode a string with python? and urllib.quote is wrong

Question

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:

If i try

urllib.quote('á')

i get

'%C3%A1'

But thats not the correct output, it should be %E1

As demostrated by the tool provided here this site

And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try

urllib.quote('\images\á\some file.jpg')

And then i try with the javascript tool i mentioned i get this strings respectively

%5Cimages%5C%C3%A1%5Csome%20file.jpg

%5Cimages%5C%E1%5Csome%20file.jpg

Note how is almost the same but the url provided by quote doesn't work and the other one it does. I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference. I tried with other spanish words with accents and the ñ they all are differently represented.

Is this a python bug? Do you know some module that get this right?

both javascript and python are using the same encoding? Have you tried unicode? repr('á') == "'\\xc3\\xa1'" and repr(u'á') == "u'\\xe1'" — JBernardo
– JBernardo, Commented Jun 14, 2011 at 2:34
0xc3a1 is a UTF-8 representation of LATIN SMALL LETTER A WITH ACUTE. — sarnold
– sarnold, Commented Jun 14, 2011 at 2:38
@sarnold oh that helps, now i know i want my urls in unicode not in utf-8, but doing unicode(urllib.quote(string)) is not working. — Guillermo Siliceo Trueba
– Guillermo Siliceo Trueba, Commented Jun 14, 2011 at 3:51

Community · Accepted Answer · 2021-10-07 05:51:50Z

7

According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.

See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.

edited Oct 7, 2021 at 5:51

CommunityBot

11 silver badge

answered Jun 14, 2011 at 2:38

Anomie

95.5k13 gold badges130 silver badges145 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Guillermo Siliceo Trueba Over a year ago

could be, but none of my browsers find the resourse with the "updated" encoding.

Guillermo Siliceo Trueba Over a year ago

So, any module that uses the outdated but actually working encoding?

sarnold Over a year ago

@Guillermo, can you update your server to allow newer HTTP clients to request resources as specified in the newer RFC?

Guillermo Siliceo Trueba Over a year ago

I'm using web.py's internal server can't do anything about that for the time being.

Guillermo Siliceo Trueba · Accepted Answer · 2011-06-14 04:40:36Z

3

Ok, got it, i have to encode to iso-8859-1 like this

word = u'á'
word = word.encode('iso-8859-1')
print word

answered Jun 14, 2011 at 4:40

Guillermo Siliceo Trueba

4,6496 gold badges38 silver badges49 bronze badges

Comments

BudgieInWA · Accepted Answer · 2011-06-14 02:40:17Z

1

Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.

Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.

# coding: utf-8

answered Jun 14, 2011 at 2:40

BudgieInWA

2,2761 gold badge18 silver badges33 bronze badges

Comments

user780363 · Accepted Answer · 2011-06-14 02:42:35Z

0

What about using unicode strings and the numeric representation (ord) of the char?

>>> print '%{0:X}'.format(ord(u'á'))
%E1

answered Jun 14, 2011 at 2:42

user780363

2 Comments

krubo Over a year ago

It's a hack, but a hack may be required for a website that still requires ISO-8859-1. Most webservers are now compliant with UTF-8, as assumed by urllib.

Guillermo Siliceo Trueba Over a year ago

Works but looks like black magic, and it doesnt work with more than 1 character, and looping over all my content just doesnt seem like a good idea.

Community · Accepted Answer · 2017-05-23 12:26:42Z

0

In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered Jun 14, 2011 at 3:57

Guillermo Siliceo Trueba

4,6496 gold badges38 silver badges49 bronze badges

1 Comment

Guillermo Siliceo Trueba Over a year ago

I spoke too soon, those functions do not output the Unicode code point that is needed.

Collectives™ on Stack Overflow

how to url-safe encode a string with python? and urllib.quote is wrong

5 Answers 5

4 Comments

Comments

Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Comments

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related