0

I have the following string -

"\xed\xad\x80\xed\xb1\x93"

When using this string to execute queries in the PostgreSQL DB, it raises the following error -

DataError: invalid byte sequence for encoding "UTF8": 0xed 0xad 0x80

When testing it in python 2.7 (before executing the query) it doesn't raise an exception -

Windows test -

'\xed\xad\x80\xed\xb1\x93'.decode("utf-8")
u'\U000e0053'

Linux test -

'\xed\xad\x80\xed\xb1\x93'.decode("utf-8")
u'\udb40\udc53'

In python3, it actually raises an exception -

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

How can I check in python 2.7 that it's not a valid utf-8 string?

1 Answer 1

1

It is a valid UTF-8 code, but it does not belong to a character.

0xEDAD80 converts to UNICODE code point DB40, which is a “high surrogate” and not a character as such.

So these data are not UTF-8 encoded characters. It makes no sense to encode surrogates in UTF-8, they are normally used in encodings like UTF-16 and UCS-2.

RFC 3629 actually declares that encoding surrogates is not allowed:

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

So that sounds like a bug in Python v2, and you can report it as such.

Sign up to request clarification or add additional context in comments.

5 Comments

Yeah.. But how can I check in python 2.7 if it's really not utf-8?
No idea. But you tagged PostgreSQL, so I thought it might be useful to explain what's going on.
@LaurenzAlbe: since RFC 3629 encoding individual surrogate halves is no longer valid in UTF-8. It is actually invalid.
@JoachimSauer Thanks for the information, that makes this a Python bug.
Note: Python3 has "surrogateescape": it just uses surrogate codepoints to encode non Unicode data. So it may be also the case for Python2.7. Note: this is a special case, and very seldom used (were there is no other way, e.g. on "strings" where you may receive raw bytes, like sys.argv, and system environments (true decoding will lose maybe some important information, but it is good if we can handle as text, for the 99.99% of cases)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.