5

This question is essentially the same as this question, except on Python.

I wish to query rows from a PostgreSQL database ordered by the e-mail address column and then perform operations in Python that rely on that ordering.

The database I'm querying is using the en_US.UTF8 collation, which with a few tests, I'm finding has some peculiar behavior with respect to the @ symbol in the e-mail addresses:

mydb=> SELECT '0'  < '@';
 ?column? 
----------
 f
(1 row)

mydb=> SELECT '0'  < '@0';
 ?column? 
----------
 t
(1 row)

This answer suggests that an @ symbol may be ignored by some collations, but if that were the case here, I'd have expected a t from the second query.

Although Python supplies a locale module, that module has inconsistent behavior on some platforms, so I seem to be unable to use that module for this purpose.

Based on that report, I tried the recommendation to use the PyICU package, which seemed promising:

>>> import icu
>>> collator = icu.Collator.createInstance()
>>> collator.getLocale()
<Locale: en_US>
>>> collator.getSortKey('0') < collator.getSortKey('@')
False
>>> collator.getSortKey('0') < collator.getSortKey('@0')
False

But as you can see, in the last comparison, it's yielding a different order than postgres does.

I've tried specifying a different collation for the query, something like:

SELECT email COLLATE posix FROM mytable ORDER by email;

But that results in an error: collation "posix" for encoding "UTF8" does not exist. I tried also a collation of "en-us-x-icu", but that also does not exist.

Is there any way to reliably query a column of e-mail addresses from PostgreSQL in an order upon which a Python program could rely, either by adapting the collation of the query or by honoring the default collation in Python?

4
  • Could you sort them in Python? Commented Jan 6, 2019 at 16:50
  • If it's the same but in Python, have you tried converting the answer with an implementation to Python? What happened? Commented Jan 6, 2019 at 16:53
  • I can't sort them in Python without significant performance degradation. There are millions of rows in multiple databases that would have to be loaded into memory. I do want to rely on the sort order optimizations afforded by the database. Commented Jan 6, 2019 at 17:56
  • The Java answers were largely Java-specific, although the accepted answers turn out to be the same - use the "C" collation. I had tried "posix" without success, so had not held hope that "C" would help. Commented Jan 6, 2019 at 17:59

1 Answer 1

2

Use collate "C" in Postgres:

with test(test) as (
values ('@'), ('@0'), ('0')
)

select test
from test
order by test collate "C"

 test 
------
 0
 @
 @0
(3 rows)

Python:

>>> test = ['@', '@0', '0']
>>> test.sort()
>>> test
['0', '@', '@0']    
Sign up to request clarification or add additional context in comments.

4 Comments

Can you please convert this code into java? thanks
It's simple sorting, there is nothing to convert I think. But even so I don't use Java, unfortunately.
But this code is sorting the way postgres does, in python. right?
Right. But the key is collate "C" because default sorting in Postgres is different.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.