I have a text file with strings. These strings ultimately represent URL paths (not full URLs), but have been encoded in several ways. Here is an excerpt of the file:
25_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome%2C_Italy
I would like to guarantee a common format for all these strings, as after loading the file I will need to do string comparisons (e.g. Rome%2C_Italy should equal Rome,_Italy).
Some lines are URL encoded, and can be easily unquoted:
import urllib
with open("input.txt") as f:
for line in f:
str = urllib.unquote(line.rstrip())
print str
The output of the previous code is:
25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy
My best attempt is the following code:
import urllib
with open("input.txt") as f:
for line in f:
str = urllib.unquote(line.rstrip()).encode("utf8")
print str
with the following output:
25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy
It seems to have ignored some lines!
In any case, I believe it would be preferrable to simply URL-encode all these strings (as with line 1), but the urllib.quote() method will not work well on the lines that are already URL-encoded (it will encode % again!).
Any help clearing up my confusion is appreciated!
.txtfile!\xD1-style combinations are also UTF-8 encoded, and I would have expected thestr.encode()method to convert it.