python: about url encode and decode

Question

I have a problem. I'm trying to use urllib library in python. but, I don't understand of it.

a = 'http%3A%2F%2Ffile%2Efir%2Enet%2F40d55cecf9a3a47851b1d0ebda3e423993c837d3ca%2F20110909%5F52%5Fblogfile%2Folsscj25%5F1315512137967%5F5tAuGI%5Fzip%2F%255B%25C0%25A9%25B5%25B5%25BF%25ECxp%255D%2B%25C0%25A9%25B5%25B5%25BF%25ECxp%2B%25BD%25C3%25B8%25AE%25BE%25F3%25B3%25D1%25B9%25F6%5F%2Ezip'

aa = unquote(unquote(a))
'http://file.fir.net/40d55cecf9a3a47851b1d0ebda3e423993c837d3ca/20110909_52_blogfile/olsscj25_1315512137967_5tAuGI_zip/[\xc0\xa9\xb5\xb5\xbf\xecxp]+\xc0\xa9\xb5\xb5\xbf\xecxp+\xbd\xc3\xb8\xae\xbe\xf3\xb3\xd1\xb9\xf6_.zip'

a1 = quote(quote(aa))
'http%253A//file.fir.net/40d55cecf9a3a47851b1d0ebda3e423993c837d3ca/20110909_52_blogfile/olsscj25_1315512137967_5tAuGI_zip/%255B%25C0%25A9%25B5%25B5%25BF%25ECxp%255D%252B%25C0%25A9%25B5%25B5%25BF%25ECxp%252B%25BD%25C3%25B8%25AE%25BE%25F3%25B3%25D1%25B9%25F6_.zip'

Why does not equal two values(a and a1). Please let me know

Thanks.

Y.H Wong · Accepted Answer · 2012-04-09 09:40:29Z

I think you are convoluting multiple problems into 1.

First of all, the only reason you are asking this question is because you want to unquote the tail portion of the file name, which seems to be quoted twice.

Second of all, the file name, even if doubly unquoted, results in non-utf-8 encoded data and it's not printable.

Thirdly, you don't seem to understand the URL format.

An finally, you don't understand what quote and unquote are actually doing.

urllib.quote() and urllib.unquote() are intended only for the path_info portion of the URL, which is everything after http://file.fir.net/.

urllib.quote() replaces everything in the string parameter that is not "safe in a URL with percent encoding. Meaning every character that will cause problems (e.g: :~[SPACE] etc.) with a %BYTES_IN_HEX format.

Since [:] is not safe in the URL's path portion, quote() will encode it with it's percent-encoding.

All these means that you should not pass the entire URL straight into the quote() unless you happen to want to actually encode a URL into the path_info portion of a URL.

The steps to solve your problem is something like this:

Fix the file name encoding to use something printable to help you debug.
urllib.unquote() once to get back a normal URL.
When you get the unquoted URL, pass it to urlparse.urlparse() first to break the components into their appropriate portions.
urllib.unquote() the file name portion.
Now you can retrieve the original file name, you can proceed to do whatever you need to do.

References:

http://docs.python.org/library/urlparse.html

http://docs.python.org/library/urllib.html

kgr · Accepted Answer · 2012-04-09 09:23:06Z

0

The answer is in the documentation on quote method:

... Letters, digits, and the characters '_.-' are never quoted. ...

a and a1 differ because a probably wasn't quoted using quote() and therefore more characters where quoted than it is required. The a1 is still valid quoted string, but some characters wheren't quoted because they don't have to.

answered Apr 9, 2012 at 9:23

kgr

10k2 gold badges41 silver badges44 bronze badges

2 Comments

user1161599 Over a year ago

Thanks for you answer. How can i make equal values. My user wants search url like value aa. So I have to encode aa to a. Cloud you help it?

kgr Over a year ago

First please edit your post and tell us how did you obtain a, i.e. how was it quoted ?

Collectives™ on Stack Overflow

python: about url encode and decode

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related