0

This should be an easy one I hope. I have a url:

http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol%C3%A9on.jpg

that is saved into a json file with this code:

paintings = get_all_paintings(marc_chagall)
with open('chagall.json', 'w') as fb:
    x = json.dump(paintings, fb)

In the file, the URL has become:

u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'

I am able to get the original, usable, percent-encoded URL with this code:

p = u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'
p = urllib.quote(p.encode('utf8'), safe='/:')
print repr(p) 
> 'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol%C3%A9on.jpg'

Now comes the tricky part. I want to get this string:

http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napoléon.jpg

with the non-ascii character in napoléon intact. This is for naming purposes in the storage bucket, not for anything else. How can I produce this string?

2 Answers 2

4

Just print the unicode value:

>>> print u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'
http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napoléon.jpg

Don't confuse the python representation of the Unicode value (which is deliberately using escapes for non-ASCII characters for ease of debugging and introspection) with the actual value.

Printing encodes the value to the codec used by your console or terminal, provided Python was able to detect it. My terminal is set to UTF-8, so Python encoded the U+00E9 unicode code point to C3 A9 bytes and my terminal then interpreted that as UTF-8 and displayed the é.

This all just means that you already have the right value, but were thrown by the debugging output.

Sign up to request clarification or add additional context in comments.

3 Comments

I want to save the last part to a variable, like x.split('/')[-1]
@edmund_spenser: then just do so. Unicode strings support splitting just like byte strings do.
I was really thrown by, like you said, the python representation of the Unicode value. I didn't realize what I had.
1

You already have it:

print u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'

The value of p already is already that string, it's only displayed differently.

2 Comments

That prints it to the console, but how do I save it to a variable and store it?
@edmund_spenser: the variable p already contains the string you want (exactly), it's only displayed differently (the sequence \xe9 is the character you want).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.