Python String Encoding that uses &#x**;

Question

I have one (non-python) script that encodes a string a certain way, and I need a python script to use the same method of encoding.

The original string is Hügelkultur and the original script converts that to Hügelkultur.

Right now my python script isn't doing any type of encoding and the output file shows HÃ¼gelkultur.

If I run this script:

string = "Hügelkultur"
string = string.encode()
print(string)

It outputs H\xc3\xbcgelkultur.

How would I want to encode the string to give me one that matches the original?

Those are (hex) HTML entities. You're looking for an HTML encoder. — deceze
– deceze ♦, Commented Aug 27, 2019 at 14:53
do you happen to know which encoding you're using in your original script? — adiaz004
– adiaz004, Commented Aug 27, 2019 at 14:57
It might be tricky to get exactly that, since there are multiple entities that could refer to the same Unicode character. ü, for example, could be ü or ü. Whether you need an entity depends on the encoding of your file; as long as you can produce a UTF-8 encoded file, you can simply specify ü as the bytes \xc3\xbc directly instead of an entity. — chepner
– chepner, Commented Aug 27, 2019 at 15:03
@adiaz004, I don't. There's a script that runs and generates some static files. I need to make my output match with that one. — GFL
– GFL, Commented Aug 27, 2019 at 15:03
The two outputs are matched against each other to create a nodes and edges in a webgraph. If they're not the same then they are considered different nodes. — GFL
– GFL, Commented Aug 27, 2019 at 15:16

wim · Accepted Answer · 2019-08-27 15:42:34Z

2

I'm not aware of a stdlib encoder to get that output, but an ascii encoding with an xmlcharrefreplace will get you most of the way there:

>>> "Hügelkultur".encode("ascii", errors="xmlcharrefreplace")
b'H&#252;gelkultur'

The #xFC is like #252 because 0xfc == 252, and HTML decoders should be happy with either form. However, if you do need an exact style match, perhaps just write a simple function to do this manually:

>>> def convert(s):
...     chars = []
...     for char in s:
...         if char.isascii():
...             chars.append(char)
...         else:
...             chars.append(f"&#x{ord(c):X};")
...     return "".join(chars)
...
>>> convert("Hügelkultur")
'H&#xFC;gelkultur'

For posterity's sake, going back the other way is an html unescape:

>>> import html
>>> html.unescape('H&#xFC;gelkultur')
'Hügelkultur'

edited Aug 27, 2019 at 15:42

answered Aug 27, 2019 at 15:05

wim

368k114 gold badges682 silver badges819 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python String Encoding that uses &#x**;

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related