2

I've received a unicode string from the wild that causes some of our psycopg2 statements to fail.

I have reduced the problem down to a SSCE:

import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()
x = u'\ud837'
cur.execute("SELECT %s", (x,))
print cur.fetchone()

Running this gives the following exception:

Traceback (most recent call last):
  File ".../run.py", line 65, in <module>
    cur.execute("SELECT %s AS test", (x,))
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xb7

Based on some of the comments, it has become clear that this particular character is one half of a surrogate pair, making it invalid to live on its own.

Specifically then, I am looking for a mechanism to detect when a string contains an incomplete surrogate pair in Python 2.

One such method I have found that leads to an exception is trying x.encode('utf16').decode('utf16'), however, since I don't totally understand the risks associated, I would be somewhat concerned here.

Edit: Reduced SSCE string to single character causing the problem, added information based on comments.

3
  • The character represents one half of a surrogate pair and doesn't represent a code point of its own. Presumably you obtained it through an API that split a UTF-16-encoded string without paying attention to character boundaries. Commented Nov 14, 2016 at 19:49
  • @user4815162342 so how can I detect whether a given string in python contains any such incomplete surrogate pairs? Commented Nov 14, 2016 at 19:51
  • Just curious, has my answer helped with the question? Commented Nov 30, 2016 at 17:58

2 Answers 2

2

The string u'\ud837' consists of a lone member of a surrogate pair, two physical characters that appear in sequence to form a logical character. As such, it does not define a Unicode code point - instead, it is an implementation detail of the UTF-16 encoding which uses it to pack the full code point range into 16-bit code units. Python 3 correctly rejects attempts to encode lone surrogates in any byte encoding, including the UTF-* variants.

The string probably originated from a system that internally uses UTF-16 (such as Java, C#, Windows, or Python 2 built with 16-bit Py_UNICODE) that naively shortened the string without taking care of surrogates.

Taking the regex from this answer, it should be possible to efficiently detect such strings using code such as:

import re

lone = re.compile(
    ur'''(?x)            # verbose expression (allows comments)
    (                    # begin group
    [\ud800-\udbff]      #   match leading surrogate
    (?![\udc00-\udfff])  #   but only if not followed by trailing surrogate
    )                    # end group
    |                    #  OR
    (                    # begin group
    (?<![\ud800-\udbff]) #   if not preceded by leading surrogate
    [\udc00-\udfff]      #   match trailing surrogate
    )                    # end group
    ''')

def invalid_unicode(s):
    assert isinstance(s, unicode)
    return lone.search(s) is not None
Sign up to request clarification or add additional context in comments.

Comments

2

To detect that the string is invalid utf-8, just wrap an attempt to encode it inside a try/except before executing it in psycopg2.

As for what caused the problem, there is a specific character in the middle of the string that is utf-16 encoded: \U000d8a85. So it's not that Postgres does not consider it utf-8, it really isn't.

2 Comments

Thanks for the explanation, but x.encode('utf-8') does not cause an exception. Neither does x.encode('utf-8').decode('utf-8'). Which leads me to believe either: python believes this to be valid utf-8, or python has fallbacks to decode utf-8 in a non-strict way.
Also, after further tinkering, it appears the specific character causing the problem is \ud837 -- any idea what's going on there?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.