SQLAlchemy Text Matching data inside JSON field with UTF-8

Question

I have table called Message which has column content of type JSON

My Model definition as requested is as following

class Message(db.Model):
    ...
    content = db.Column(JSON)
    ...

now I perform text matching for simple search with the following query

Message.query.filter(Message.content['summary'].cast(Unicode).match(term))

it works nicely, until term has a utf-8 character, like German umlauts or French accents.

what would be the solution here?

also keep in mind I'm using Python 3

Do you need the cast method? Decoding has to be performed before JSON is parsed, so when you access content['summary'], this should already have happened. — lenz
– lenz, Commented Apr 19, 2017 at 10:24
@lenz probably yes, otherwise if i remove it, i get this error sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) operator does not exist: json @@ tsquery LINE 3: WHERE bots_messages.content -> 'summary' @@ to_tsquery(bla bla bla — EzzatA
– EzzatA, Commented Apr 19, 2017 at 10:29
You should include the definition of Message in the question, as it is a crucial part of the issue at hand (json vs. jsonb). — Ilja Everilä
– Ilja Everilä, Commented Apr 19, 2017 at 11:30

Ilja Everilä · Accepted Answer · 2017-04-19 20:52:52Z

5

The problem lies in cast(Unicode) of a Postgresql json column. It simply CASTs the json to the text type underlying SQLAlchemy's Unicode, in case of Postgresql VARCHAR. In other words it produces a string representation of JSON instead of extracting the text content. If your input contained escaped unicode codepoints, they're output as is in this case. Given a simple Test model with a json column data:

In [7]: t = Test(data={'summary': 'Tämä on summary.'})

In [8]: session.add(t)

In [9]: session.commit()

In [11]: session.query(Test.data['summary'].cast(Unicode)).scalar()
Out[11]: '"T\\u00e4m\\u00e4 on summary."'

It should be evident why a match with unescaped unicode characters will fail. The correct way to extract the text content, unescaping escaped unicode, is to use astext, which uses the ->> operator in Postgresql:

In [13]: session.query(Test.data['summary'].astext).scalar()
Out[13]: 'Tämä on summary.'

Quoting the JSON functions and operators documentation:

Note: Many of these functions and operators will convert Unicode escapes in JSON strings to the appropriate single character. This is a non-issue if the input is type jsonb, because the conversion was already done; but for json input, this may result in throwing an error, as noted in Section 8.14.

So in your case:

Message.query.\
    filter(Message.content['summary'].astext.match(term))

Note that this only applies to json type, not jsonb, because the json type does not convert unicode escapes on input. jsonb on the other hand converts all unicode escapes to equivalent ASCII or UTF-8 characters for storage. If our Test model contained a second column data2 jsonb, with the exact same input, then the result would be:

In [11]: session.query(Test.data['summary'].cast(Unicode),
    ...:               Test.data2['summary'].cast(Unicode)).first()
Out[11]: ('"T\\u00e4m\\u00e4 on summary."', '"Tämä on summary"')

Still, you should use astext, if you want text instead of a string representation of JSON.

edited Apr 19, 2017 at 20:52

answered Apr 19, 2017 at 10:48

Ilja Everilä

53.4k9 gold badges138 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

EzzatA Over a year ago

Thanks a lot for the very informative and helpful answer :)

Ilja Everilä Over a year ago

You're welcome. The original explanation was lacking and a bit wrong, so updated to mention the differences between json and jsonb and unicode escapes.

Collectives™ on Stack Overflow

SQLAlchemy Text Matching data inside JSON field with UTF-8

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related