Python: Encoding issues?

Question

in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem. But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).

What's the problem here, and how can I handle this?

Thanks a lot for any help!

EDIT: At the moment, I use the following: output.write(text.decode("utf-8")) This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\xc2\xadrung How can we solve this problem? Thanks a lot!

Is there an umlaut or some other special char in the string? — immortal
– immortal, Commented Sep 6, 2013 at 9:48
Yes, there are umlaut and other special char in the string. So, I should handle the problem with "schilde rung" (which works with the printable or encode-solutions) BUT I should also keep the umlaut and other special char which are correctly represented... — MarkF6
– MarkF6, Commented Sep 6, 2013 at 11:06

jfs · Accepted Answer · 2013-09-06 10:01:29Z

1

There is U+00AD SOFT HYPHEN before r in the string:

>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'

To remove non-ascii characters:

>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

answered Sep 6, 2013 at 10:01

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MarkF6 Over a year ago

Yeah, that's exactly the problem. Is there any way that I can check each word for this property? Because while applying this to all words, I get the error: "'ascii' codec can't decode byte 0xc3 in position 1162: ordinal not in range(128)"

jfs Over a year ago

@MarkF6: the error means that you are trying to encode bytes (that you should not do) instead of a Unicode string. If your input is a byte string that contains text in utf-8; you could call .decode to get Unicode string that has only ascii characters: b"Schilderung".decode('ascii', 'ignore') -> u'Schilderung'

MarkF6 Over a year ago

Ok, I did this, but what's about 'umlaut'?. If I do this, I lose all the 'umlaut'. (ä ö ü)

jfs Over a year ago

@MarkF6: If you don't mind non-ascii characters then just convert your input bytes to Unicode e.g. using .decode('utf-8') as shown in the very first line in the answer and stop at that. If it is not enough then update your question to specify filtering rules i.e., what categories of characters to remove (blacklist), what to preserve (whitelist), etc. There are no universal rules; you need to choose what is appropriate in your case.

MarkF6 Over a year ago

I added some specifications as comment and as EDIT to the initial post. Thanks a lot for the help!

user2725093 · Accepted Answer · 2013-09-06 09:50:41Z

0

Seems like "r" isn't ASCII:

>>> u'Schilderung'
u'Schilde\xadrung'

answered Sep 6, 2013 at 9:50

user2725093

2211 silver badge2 bronze badges

Collectives™ on Stack Overflow

Python: Encoding issues?

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related