3

in my HTML file, the word "Schilde­rung" looks normally and it doesn't seem to have an (encoding?) problem. But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).

What's the problem here, and how can I handle this?

Thanks a lot for any help!

EDIT: At the moment, I use the following: output.write(text.decode("utf-8")) This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\xc2\xadrung How can we solve this problem? Thanks a lot!

3
  • 1
    show us print(repr(the_word)) Commented Sep 6, 2013 at 9:47
  • Is there an umlaut or some other special char in the string? Commented Sep 6, 2013 at 9:48
  • Yes, there are umlaut and other special char in the string. So, I should handle the problem with "schilde rung" (which works with the printable or encode-solutions) BUT I should also keep the umlaut and other special char which are correctly represented... Commented Sep 6, 2013 at 11:06

2 Answers 2

1

There is U+00AD SOFT HYPHEN before r in the string:

>>> "Schilde­rung".decode('utf-8')
u'Schilde\xadrung'

To remove non-ascii characters:

>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11
Sign up to request clarification or add additional context in comments.

5 Comments

Yeah, that's exactly the problem. Is there any way that I can check each word for this property? Because while applying this to all words, I get the error: "'ascii' codec can't decode byte 0xc3 in position 1162: ordinal not in range(128)"
@MarkF6: the error means that you are trying to encode bytes (that you should not do) instead of a Unicode string. If your input is a byte string that contains text in utf-8; you could call .decode to get Unicode string that has only ascii characters: b"Schilde­rung".decode('ascii', 'ignore') -> u'Schilderung'
Ok, I did this, but what's about 'umlaut'?. If I do this, I lose all the 'umlaut'. (ä ö ü)
@MarkF6: If you don't mind non-ascii characters then just convert your input bytes to Unicode e.g. using .decode('utf-8') as shown in the very first line in the answer and stop at that. If it is not enough then update your question to specify filtering rules i.e., what categories of characters to remove (blacklist), what to preserve (whitelist), etc. There are no universal rules; you need to choose what is appropriate in your case.
I added some specifications as comment and as EDIT to the initial post. Thanks a lot for the help!
0

Seems like "r" isn't ASCII:

>>> u'Schilde­rung'
u'Schilde\xadrung'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.