How to normalize Python string encodings

Question

I have a text file with strings. These strings ultimately represent URL paths (not full URLs), but have been encoded in several ways. Here is an excerpt of the file:

25_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome%2C_Italy

I would like to guarantee a common format for all these strings, as after loading the file I will need to do string comparisons (e.g. Rome%2C_Italy should equal Rome,_Italy).

Some lines are URL encoded, and can be easily unquoted:

import urllib
with open("input.txt") as f:
    for line in f:
        str = urllib.unquote(line.rstrip())
        print str

The output of the previous code is:

25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy

My best attempt is the following code:

import urllib
with open("input.txt") as f:
    for line in f:
        str = urllib.unquote(line.rstrip()).encode("utf8")
        print str

with the following output:

25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy

It seems to have ignored some lines!

In any case, I believe it would be preferrable to simply URL-encode all these strings (as with line 1), but the urllib.quote() method will not work well on the lines that are already URL-encoded (it will encode % again!).

Any help clearing up my confusion is appreciated!

What you're attempting to do doesn't make sense to me since you're destroying the url — FirebladeDan
– FirebladeDan, Commented Jul 28, 2015 at 15:26
As far as I understand, the strings with \xD1-style combinations are also UTF-8 encoded, and I would have expected the str.encode() method to convert it. — gdiazc
– gdiazc, Commented Jul 28, 2015 at 15:30

PM 2Ring · Accepted Answer · 2015-07-29 16:15:43Z

This code uses a similar approach to Eugene Lisitsky except that it runs on Python 2. There may be a neater way to do this in Python 2, but it appears to work correctly on the data in the OP.

BTW, you should tag your question with an appropriate Python version tag when you ask a question relating to Unicode, since Unicode handling in Python 3 is quite different to how it works (or fails to do so :) ) in Python 2.

import codecs
import urllib

fname = 'input.txt'

with open(fname, 'rb') as f:
    for line in f:
        line = line.strip()
        line = urllib.unquote(line)
        if r'\x' in line:
            line = codecs.unicode_escape_decode(line)[0]
            line = line.encode('latin1')

        line = line.decode('utf-8')
        print repr(line), line

output

u'25_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 25_рашәара
u'2_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 2_рашәара
u'5_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 5_рашәара
u'\u0410\u043a\u0430\u0431\u0430' Акаба
u'\u0410\u0448\u04d9\u0430\u0445\u044c\u0430' Ашәахьа
u'function.fopen' function.fopen
u'\u0411\u0440\u0430\u0437\u0438\u043b\u0438\u0430' Бразилиа
u'\u0412\u0430\u043b\u0435\u0440\u0438\u0438_\u041c\u0430\u0438\u0440\u043e\u043c\u0438\u0430\u043d' Валерии_Маиромиан
u'Rome,_Italy' Rome,_Italy
u'Rome,_Italy' Rome,_Italy

As you can see, I've converted all the strings to Unicode objects. If for some reason you want them as plain Python 2 strings just eliminate the line = line.decode('utf-8') line.

Eugene Lisitsky · Accepted Answer · 2015-07-28 16:23:28Z

0

You may use codecs.unicode_escape_decode to decode backslash-escaped characters like so:

>>> import codecs
>>> s=r"\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0"
>>> print(s)
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
>>> s1=codecs.unicode_escape_decode(s)[0]
>>> print(s1)
ÐÐºÐ°Ð±Ð°
>>> bytes(s1,'latin1').decode('utf-8')
'Акаба'
>>>

answered Jul 28, 2015 at 16:23

Eugene Lisitsky

13k6 gold badges42 silver badges63 bronze badges

1 Comment

PM 2Ring Over a year ago

This won't work for the OP, since he's using Python 2.

Collectives™ on Stack Overflow

How to normalize Python string encodings

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related