0

I have a db query like so which I am executing in Python on a Postgres database:

"Select * from my_tbl where big_string like '%Almodóvar%'"

However, in the column I am searching on Almodóvar is represented as 'Almod\u00f3var' and so the query returns nothing.

What can I do to make the two strings match up? Would prefer to work with Almodóvar on the Python side rather than the column in the database but I am flexible.

Additional info prompted by comments:

The database uses UTF-8. The field I am querying on is acquired from an external API. The data was retrieved RESTfully as json and then inserted into a text field of the database after a json.dump.

Because the data includes a lot of foreign names and characters, working with it has been a series of encoding-related headaches. If there is a silver bullet for making this data play nice with Python, I would be very grateful to know what that is.

UPDATE 2:

It looks like it's json encoding that created my quandary.

print json.dumps("Almodóvar")

yields

"Almod\u00f3var"

which is what I see when I look at the raw data. However, when I use json.dumps to construct this:

"Select * from my_tbl where big_string like '%Almod\u00f3var%'"

the query still yields nothing. I'm stumped.

3 Answers 3

2

from help(json.dumps):

If ``ensure_ascii`` is false, all non-ASCII characters are not escaped, and
the return value may be a ``unicode`` instance. See ``dump`` for details.

from help(json.loads):

If ``s`` is a ``str`` instance and is encoded with an ASCII based encoding
other than utf-8 (e.g. latin-1) then an appropriate ``encoding`` name
must be specified. Encodings that are not ASCII based (such as UCS-2)
are not allowed and should be decoded to ``unicode`` first.

so try something like

>>> js = json.dumps("Almodóvar", ensure_ascii=False)  
>>> res = json.loads(js, encoding="utf-8")
>>> print res
Almodóvar
Sign up to request clarification or add additional context in comments.

2 Comments

This is helpful. Looks like the best thing to do would be to update the dumps in the database with ensure_ascii=False and see how that goes. Thanks for the detailed explanation. This might solve a lot of my problems. First mistake was probably loading json without the encoding parameter. Will test this against my current problem and accept this answer if it works.
I created a new column with data loaded and dumped per your post and was able to execute the search mentioned in my question. Thanks!
1

Your issue seems to be from a step before your query. From the time you retrieved the data from the Web service. It could be:

  • The encoding is not set to UTF-8 during your communication with the Web service.
  • The encoding from tmdb.org side is not UTF-8 (I'm not sure).

I would look into these 2 points starting with the second possibility first.

1 Comment

Yeah, I think my mistake was loading the data from the service without setting the encoding parameter. It's been a series of encoding problems since.
0

Set the character encoding of your postgres tables to utf-8, then it will integrate smoothly with python. Without any need for converting to and fro. Your problem looks like you are using two different encodings for your python code and your DB.

Edit: Almod\u00f3var looks to me like windows code page 1252.

2 Comments

Out of curiosity, if he has the values already inserted and represented as Almod\u00f3var for example, would changing the DB's encoding change the representation of these previously inserted values to Almodóvar. Or does he have to perform some handling?
I have confirmed that the encoding for the database is already UTF-8. Will update my question with more info about the data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.