1

I have this query in Postgres where I'm ordering a small amount of rows according to a varchar field. There seems to be an error in ordering UTF8 strings in Postgres:

For example:

'W' in UTF-8 is 87, while 'g' is 103, but running SELECT 'W' < 'g'; will return false while running SELECT convert_to('W', 'SQL_ASCII') < convert_to('g', 'SQL_ASCII')'; will return true.

The collation is en_US.UTF-8.

Is there a good explanation for this behavior? And how to avoid it?

2 Answers 2

1

The ordering is not on the Unicode code points, but defined by the collation. And in UTF-8 we have 'A'<'a'<'B'<'b' etc.

Most people (except coders) expect this ordering. But feel free to collate with ASCII where you need it.

Sign up to request clarification or add additional context in comments.

Comments

1

This will show the ascii collation ordering of some of the first unicode code points, if you are using the utf8 encoding:

select s, chr(s) from generate_series(32, 255) s order by chr(s) collate "C";

Now the same for the pt_BR (brazilian portuguese) collation:

select s, chr(s) from generate_series(32, 255) s order by chr(s) collate "pt_BR";

What you call collation (en_US.UTF-8) is the collation before the dot and the encoding after the dot.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.