URL decode UTF-8 in Python

Question

In Python 2.7, given a URL like:

example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0

How can I decode it to the expected result, example.com?title==правовая+защита?

I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.

In the general case, the tail of a URL is just a cookie. You can't know which local character-set encoding the server uses or even whether the URL encodes a string or something completely different. (Granted, many URLs do encode a human-readable string; and often, you can guess the encoding very easily. But it's not possible in the generally case or completely automatically.) — tripleee
– tripleee, Commented Jan 29, 2018 at 12:45

Keyur Potdar · Accepted Answer · 2019-05-05 16:40:09Z

691

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:

from urllib.parse import unquote

url = unquote(url)

Demo:

>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'

The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:

from urllib import unquote

url = unquote(url).decode('utf8')

edited May 5, 2019 at 16:40

Keyur Potdar

7,2386 gold badges27 silver badges40 bronze badges

answered May 15, 2013 at 13:19

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AlexLordThorsen Over a year ago

So why is the + character left in the string? I thought that %2B was the + character and + literals were removed during decoding?

Martijn Pieters Over a year ago

@Rawrgulmuffins + is a space in x-www-form-urlencoded data; you'd use urllib.parse.parse_qs() to parse that, or use urllib.parse.unquote_plus(). But they should only appear in the query string, not the rest of the URL.

Karl Knechtel · Accepted Answer · 2023-01-10 00:46:31Z

181

If you are using Python 3, you can use urllib.parse.unquote:

url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""

import urllib.parse
urllib.parse.unquote(url)

gives:

'example.com?title=правовая+защита'

edited Jan 10, 2023 at 0:46

Karl Knechtel

61.4k14 gold badges133 silver badges193 bronze badges

answered Sep 8, 2015 at 7:42

Pavan

3,4353 gold badges27 silver badges29 bronze badges

2 Comments

Clocker Over a year ago

using this and getting a dict instead of query string on python3.8

Karl Knechtel Over a year ago

@Clocker can't reproduce. Make sure to follow the example exactly. If you are having difficulty adapting it to your own needs, ask your own question, making sure to follow the advice in How to Ask and minimal reproducible example.

ivanleoncz · Accepted Answer · 2020-08-10 23:19:58Z

31

You can achieve an expected result with requests library as well:

import requests

url = "http://www.mywebsite.org/Data%20Set.zip"

print(f"Before: {url}")
print(f"After:  {requests.utils.unquote(url)}")

Output:

$ python3 test_url_unquote.py

Before: http://www.mywebsite.org/Data%20Set.zip
After:  http://www.mywebsite.org/Data Set.zip

Might be handy if you are already using requests, without using another library for this job.

answered Aug 10, 2020 at 23:19

ivanleoncz

10.3k7 gold badges62 silver badges53 bronze badges

3 Comments

lfurini Over a year ago

Works with Python 2 too.

bfontaine Over a year ago

This is just an alias for urllib.parse.

ivanleoncz Over a year ago

This comment of yours added a lot. Thank you very much.

Roland Puntaier · Accepted Answer · 2021-09-26 10:04:11Z

3

In HTML the URLs can contain html entities. This replaces them, too.

#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&amp;confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))

edited Sep 26, 2021 at 10:04

answered Aug 11, 2021 at 11:24

Roland Puntaier

3,5511 gold badge36 silver badges40 bronze badges

3 Comments

pylover Over a year ago

html.unescape is unnecessary.

Roland Puntaier Over a year ago

Without unescape on my computer & in the example is not converted to &. I just checked with Python 3.9.7.

bfontaine Over a year ago

The question is about decoding URLs, not HTML.

Ξένη Γήινος · Accepted Answer · 2022-09-20 05:51:59Z

I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.

So I quickly wrote my own.

Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.

URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.

So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.

Here is the code:

def url_parse(url):
    l = len(url)
    data = bytearray()
    i = 0
    while i < l:
        if url[i] != '%':
            d = ord(url[i])
            i += 1
        
        else:
            d = int(url[i+1:i+3], 16)
            i += 3
        
        data.append(d)
    
    return data.decode('utf8')

I have tested it and it works perfectly.

Collectives™ on Stack Overflow

URL decode UTF-8 in Python

5 Answers 5

2 Comments

2 Comments

3 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

2 Comments

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related