2

I have table called Message which has column content of type JSON

My Model definition as requested is as following

class Message(db.Model):
    ...
    content = db.Column(JSON)
    ...

now I perform text matching for simple search with the following query

Message.query.filter(Message.content['summary'].cast(Unicode).match(term))

it works nicely, until term has a utf-8 character, like German umlauts or French accents.

what would be the solution here?

also keep in mind I'm using Python 3

4
  • Do you need the cast method? Decoding has to be performed before JSON is parsed, so when you access content['summary'], this should already have happened. Commented Apr 19, 2017 at 10:24
  • @lenz probably yes, otherwise if i remove it, i get this error sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) operator does not exist: json @@ tsquery LINE 3: WHERE bots_messages.content -> 'summary' @@ to_tsquery(bla bla bla Commented Apr 19, 2017 at 10:29
  • You should include the definition of Message in the question, as it is a crucial part of the issue at hand (json vs. jsonb). Commented Apr 19, 2017 at 11:30
  • @IljaEverilä Done ;) Commented Apr 24, 2017 at 8:23

1 Answer 1

5

The problem lies in cast(Unicode) of a Postgresql json column. It simply CASTs the json to the text type underlying SQLAlchemy's Unicode, in case of Postgresql VARCHAR. In other words it produces a string representation of JSON instead of extracting the text content. If your input contained escaped unicode codepoints, they're output as is in this case. Given a simple Test model with a json column data:

In [7]: t = Test(data={'summary': 'Tämä on summary.'})

In [8]: session.add(t)

In [9]: session.commit()

In [11]: session.query(Test.data['summary'].cast(Unicode)).scalar()
Out[11]: '"T\\u00e4m\\u00e4 on summary."'

It should be evident why a match with unescaped unicode characters will fail. The correct way to extract the text content, unescaping escaped unicode, is to use astext, which uses the ->> operator in Postgresql:

In [13]: session.query(Test.data['summary'].astext).scalar()
Out[13]: 'Tämä on summary.'

Quoting the JSON functions and operators documentation:

Note: Many of these functions and operators will convert Unicode escapes in JSON strings to the appropriate single character. This is a non-issue if the input is type jsonb, because the conversion was already done; but for json input, this may result in throwing an error, as noted in Section 8.14.

So in your case:

Message.query.\
    filter(Message.content['summary'].astext.match(term))

Note that this only applies to json type, not jsonb, because the json type does not convert unicode escapes on input. jsonb on the other hand converts all unicode escapes to equivalent ASCII or UTF-8 characters for storage. If our Test model contained a second column data2 jsonb, with the exact same input, then the result would be:

In [11]: session.query(Test.data['summary'].cast(Unicode),
    ...:               Test.data2['summary'].cast(Unicode)).first()
Out[11]: ('"T\\u00e4m\\u00e4 on summary."', '"Tämä on summary"')

Still, you should use astext, if you want text instead of a string representation of JSON.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for the very informative and helpful answer :)
You're welcome. The original explanation was lacking and a bit wrong, so updated to mention the differences between json and jsonb and unicode escapes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.