How to build a Python comparator that sorts strings the way PostgreSQL does?

Question

This question is essentially the same as this question, except on Python.

I wish to query rows from a PostgreSQL database ordered by the e-mail address column and then perform operations in Python that rely on that ordering.

The database I'm querying is using the en_US.UTF8 collation, which with a few tests, I'm finding has some peculiar behavior with respect to the @ symbol in the e-mail addresses:

mydb=> SELECT '0'  < '@';
 ?column? 
----------
 f
(1 row)

mydb=> SELECT '0'  < '@0';
 ?column? 
----------
 t
(1 row)

This answer suggests that an @ symbol may be ignored by some collations, but if that were the case here, I'd have expected a t from the second query.

Although Python supplies a locale module, that module has inconsistent behavior on some platforms, so I seem to be unable to use that module for this purpose.

Based on that report, I tried the recommendation to use the PyICU package, which seemed promising:

>>> import icu
>>> collator = icu.Collator.createInstance()
>>> collator.getLocale()
<Locale: en_US>
>>> collator.getSortKey('0') < collator.getSortKey('@')
False
>>> collator.getSortKey('0') < collator.getSortKey('@0')
False

But as you can see, in the last comparison, it's yielding a different order than postgres does.

I've tried specifying a different collation for the query, something like:

SELECT email COLLATE posix FROM mytable ORDER by email;

But that results in an error: collation "posix" for encoding "UTF8" does not exist. I tried also a collation of "en-us-x-icu", but that also does not exist.

Is there any way to reliably query a column of e-mail addresses from PostgreSQL in an order upon which a Python program could rely, either by adapting the collation of the query or by honoring the default collation in Python?

If it's the same but in Python, have you tried converting the answer with an implementation to Python? What happened? — jonrsharpe
– jonrsharpe, Commented Jan 6, 2019 at 16:53
I can't sort them in Python without significant performance degradation. There are millions of rows in multiple databases that would have to be loaded into memory. I do want to rely on the sort order optimizations afforded by the database. — Jason R. Coombs
– Jason R. Coombs, Commented Jan 6, 2019 at 17:56
The Java answers were largely Java-specific, although the accepted answers turn out to be the same - use the "C" collation. I had tried "posix" without success, so had not held hope that "C" would help. — Jason R. Coombs
– Jason R. Coombs, Commented Jan 6, 2019 at 17:59

klin · Accepted Answer · 2019-01-06 17:22:49Z

2

Use collate "C" in Postgres:

with test(test) as (
values ('@'), ('@0'), ('0')
)

select test
from test
order by test collate "C"

 test 
------
 0
 @
 @0
(3 rows)

Python:

>>> test = ['@', '@0', '0']
>>> test.sort()
>>> test
['0', '@', '@0']

answered Jan 6, 2019 at 17:22

klin

123k15 gold badges241 silver badges263 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Abdurehman Siddiqui Over a year ago

Can you please convert this code into java? thanks

klin Over a year ago

It's simple sorting, there is nothing to convert I think. But even so I don't use Java, unfortunately.

Abdurehman Siddiqui Over a year ago

But this code is sorting the way postgres does, in python. right?

klin Over a year ago

Right. But the key is collate "C" because default sorting in Postgres is different.

Collectives™ on Stack Overflow

How to build a Python comparator that sorts strings the way PostgreSQL does?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related