1

I have a text file with strings. These strings ultimately represent URL paths (not full URLs), but have been encoded in several ways. Here is an excerpt of the file:

25_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome%2C_Italy

I would like to guarantee a common format for all these strings, as after loading the file I will need to do string comparisons (e.g. Rome%2C_Italy should equal Rome,_Italy).

Some lines are URL encoded, and can be easily unquoted:

import urllib
with open("input.txt") as f:
    for line in f:
        str = urllib.unquote(line.rstrip())
        print str

The output of the previous code is:

25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy

My best attempt is the following code:

import urllib
with open("input.txt") as f:
    for line in f:
        str = urllib.unquote(line.rstrip()).encode("utf8")
        print str

with the following output:

25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy

It seems to have ignored some lines!

In any case, I believe it would be preferrable to simply URL-encode all these strings (as with line 1), but the urllib.quote() method will not work well on the lines that are already URL-encoded (it will encode % again!).

Any help clearing up my confusion is appreciated!

6
  • What is the extension of the file you're reading? Commented Jul 28, 2015 at 15:24
  • It's a simple .txt file! Commented Jul 28, 2015 at 15:24
  • What you're attempting to do doesn't make sense to me since you're destroying the url Commented Jul 28, 2015 at 15:26
  • As far as I understand, the strings with \xD1-style combinations are also UTF-8 encoded, and I would have expected the str.encode() method to convert it. Commented Jul 28, 2015 at 15:30
  • Plus how could function.fopen ever look like Rome,_Italy.? Commented Jul 28, 2015 at 15:32

2 Answers 2

1

This code uses a similar approach to Eugene Lisitsky except that it runs on Python 2. There may be a neater way to do this in Python 2, but it appears to work correctly on the data in the OP.

BTW, you should tag your question with an appropriate Python version tag when you ask a question relating to Unicode, since Unicode handling in Python 3 is quite different to how it works (or fails to do so :) ) in Python 2.

import codecs
import urllib

fname = 'input.txt'

with open(fname, 'rb') as f:
    for line in f:
        line = line.strip()
        line = urllib.unquote(line)
        if r'\x' in line:
            line = codecs.unicode_escape_decode(line)[0]
            line = line.encode('latin1')

        line = line.decode('utf-8')
        print repr(line), line

output

u'25_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 25_рашәара
u'2_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 2_рашәара
u'5_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 5_рашәара
u'\u0410\u043a\u0430\u0431\u0430' Акаба
u'\u0410\u0448\u04d9\u0430\u0445\u044c\u0430' Ашәахьа
u'function.fopen' function.fopen
u'\u0411\u0440\u0430\u0437\u0438\u043b\u0438\u0430' Бразилиа
u'\u0412\u0430\u043b\u0435\u0440\u0438\u0438_\u041c\u0430\u0438\u0440\u043e\u043c\u0438\u0430\u043d' Валерии_Маиромиан
u'Rome,_Italy' Rome,_Italy
u'Rome,_Italy' Rome,_Italy

As you can see, I've converted all the strings to Unicode objects. If for some reason you want them as plain Python 2 strings just eliminate the line = line.decode('utf-8') line.

Sign up to request clarification or add additional context in comments.

Comments

0

You may use codecs.unicode_escape_decode to decode backslash-escaped characters like so:

>>> import codecs
>>> s=r"\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0"
>>> print(s)
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
>>> s1=codecs.unicode_escape_decode(s)[0]
>>> print(s1)
Ðкаба
>>> bytes(s1,'latin1').decode('utf-8')
'Акаба'
>>>

1 Comment

This won't work for the OP, since he's using Python 2.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.