Searching on json encoded string in Postgres with Python

Question

I have a db query like so which I am executing in Python on a Postgres database:

"Select * from my_tbl where big_string like '%Almodóvar%'"

However, in the column I am searching on Almodóvar is represented as 'Almod\u00f3var' and so the query returns nothing.

What can I do to make the two strings match up? Would prefer to work with Almodóvar on the Python side rather than the column in the database but I am flexible.

Additional info prompted by comments:

The database uses UTF-8. The field I am querying on is acquired from an external API. The data was retrieved RESTfully as json and then inserted into a text field of the database after a json.dump.

Because the data includes a lot of foreign names and characters, working with it has been a series of encoding-related headaches. If there is a silver bullet for making this data play nice with Python, I would be very grateful to know what that is.

UPDATE 2:

It looks like it's json encoding that created my quandary.

print json.dumps("Almodóvar")

yields

"Almod\u00f3var"

which is what I see when I look at the raw data. However, when I use json.dumps to construct this:

"Select * from my_tbl where big_string like '%Almod\u00f3var%'"

the query still yields nothing. I'm stumped.

roman · Accepted Answer · 2013-08-11 21:16:27Z

2

from help(json.dumps):

If ``ensure_ascii`` is false, all non-ASCII characters are not escaped, and
the return value may be a ``unicode`` instance. See ``dump`` for details.

from help(json.loads):

If ``s`` is a ``str`` instance and is encoded with an ASCII based encoding
other than utf-8 (e.g. latin-1) then an appropriate ``encoding`` name
must be specified. Encodings that are not ASCII based (such as UCS-2)
are not allowed and should be decoded to ``unicode`` first.

so try something like

>>> js = json.dumps("Almodóvar", ensure_ascii=False)  
>>> res = json.loads(js, encoding="utf-8")
>>> print res
Almodóvar

answered Aug 11, 2013 at 21:16

roman

118k30 gold badges205 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike Girard Over a year ago

This is helpful. Looks like the best thing to do would be to update the dumps in the database with ensure_ascii=False and see how that goes. Thanks for the detailed explanation. This might solve a lot of my problems. First mistake was probably loading json without the encoding parameter. Will test this against my current problem and accept this answer if it works.

Mike Girard Over a year ago

I created a new column with data loaded and dumped per your post and was able to execute the search mentioned in my question. Thanks!

Kalaji · Accepted Answer · 2013-08-11 21:18:56Z

1

Your issue seems to be from a step before your query. From the time you retrieved the data from the Web service. It could be:

The encoding is not set to UTF-8 during your communication with the Web service.
The encoding from tmdb.org side is not UTF-8 (I'm not sure).

I would look into these 2 points starting with the second possibility first.

edited Aug 11, 2013 at 21:18

answered Aug 11, 2013 at 20:46

Kalaji

7567 silver badges10 bronze badges

1 Comment

Mike Girard Over a year ago

Yeah, I think my mistake was loading the data from the service without setting the encoding parameter. It's been a series of encoding problems since.

Hyperboreus · Accepted Answer · 2013-08-11 20:09:49Z

0

Set the character encoding of your postgres tables to utf-8, then it will integrate smoothly with python. Without any need for converting to and fro. Your problem looks like you are using two different encodings for your python code and your DB.

Edit: Almod\u00f3var looks to me like windows code page 1252.

answered Aug 11, 2013 at 20:09

Hyperboreus

32.5k9 gold badges50 silver badges88 bronze badges

2 Comments

Kalaji Over a year ago

Out of curiosity, if he has the values already inserted and represented as Almod\u00f3var for example, would changing the DB's encoding change the representation of these previously inserted values to Almodóvar. Or does he have to perform some handling?

Mike Girard Over a year ago

I have confirmed that the encoding for the database is already UTF-8. Will update my question with more info about the data.

Collectives™ on Stack Overflow

Searching on json encoded string in Postgres with Python

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related