1

I have one (non-python) script that encodes a string a certain way, and I need a python script to use the same method of encoding.

The original string is Hügelkultur and the original script converts that to Hügelkultur.

Right now my python script isn't doing any type of encoding and the output file shows Hügelkultur.

If I run this script:

string = "Hügelkultur"
string = string.encode()
print(string)

It outputs H\xc3\xbcgelkultur.

How would I want to encode the string to give me one that matches the original?

7
  • 3
    Those are (hex) HTML entities. You're looking for an HTML encoder. Commented Aug 27, 2019 at 14:53
  • do you happen to know which encoding you're using in your original script? Commented Aug 27, 2019 at 14:57
  • It might be tricky to get exactly that, since there are multiple entities that could refer to the same Unicode character. ü, for example, could be ü or ü. Whether you need an entity depends on the encoding of your file; as long as you can produce a UTF-8 encoded file, you can simply specify ü as the bytes \xc3\xbc directly instead of an entity. Commented Aug 27, 2019 at 15:03
  • @adiaz004, I don't. There's a script that runs and generates some static files. I need to make my output match with that one. Commented Aug 27, 2019 at 15:03
  • 1
    The two outputs are matched against each other to create a nodes and edges in a webgraph. If they're not the same then they are considered different nodes. Commented Aug 27, 2019 at 15:16

1 Answer 1

2

I'm not aware of a stdlib encoder to get that output, but an ascii encoding with an xmlcharrefreplace will get you most of the way there:

>>> "Hügelkultur".encode("ascii", errors="xmlcharrefreplace")
b'Hügelkultur'

The #xFC is like #252 because 0xfc == 252, and HTML decoders should be happy with either form. However, if you do need an exact style match, perhaps just write a simple function to do this manually:

>>> def convert(s):
...     chars = []
...     for char in s:
...         if char.isascii():
...             chars.append(char)
...         else:
...             chars.append(f"&#x{ord(c):X};")
...     return "".join(chars)
...
>>> convert("Hügelkultur")
'Hügelkultur'

For posterity's sake, going back the other way is an html unescape:

>>> import html
>>> html.unescape('Hügelkultur')
'Hügelkultur'
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.