Why does PostgreSQL sort part case-sensitively and part case-insensitively?

Question

I cannot understand the behavior of PostgreSQL (v11.10). Here is what I do:

create temp table test (first_name text, last_name text);
insert into test values
  ('Hanna', 'Beat'),
  ('JOAN', 'BEET'),
  ('Mark', 'Bernstein'),
  ('ALFRED', 'DOE'),
  ('henry', 'doe'),
  ('Henry', 'Doe'),
  ('Dennis', 'Doe');
select last_name, first_name from test order by last_name, first_name;

This is what I get.

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

It looks like the sorting of the first three names is case-insensitive, but for the last four it's case-sensitive. Why is that so?

In other words, if the sorting were case-sensitive, I would expect the following order:

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 Bernstein | Mark
 BEET      | JOAN
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

and if it were case-insensitive, I would expect this:

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 DOE       | ALFRED
 Doe       | Dennis
 doe       | henry
 Doe       | Henry
(7 rows)

What I get instead is a mixture of both, and that baffles me...

For completeness:

# show lc_collate; show lc_ctype;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

  lc_ctype   
-------------
 en_US.UTF-8
(1 row)

The sorting looks case sensitive do me. Can you describe your doubt in more detail? — Laurenz Albe
– Laurenz Albe, Commented Nov 23, 2022 at 10:30
@LaurenzAlbe for case-sensitive order, I would expect Beat, then Bernstein, then BEET. I edited the question to make it more clear. — mbork
– mbork, Commented Nov 23, 2022 at 10:36
It's fully case sensitive but the order is aAbBcCdDeEfF..., not abcdef...ABCDEF or ABCDEF...abcdef. Demo — Zegarek
– Zegarek, Commented Nov 23, 2022 at 10:56
Here which links to this - if you delve deep enough you'll see it can extend into diacritics and emojis: a <<< A << à <<< À < b <<< B — Zegarek
– Zegarek, Commented Nov 23, 2022 at 11:07

Laurenz Albe · Accepted Answer · 2022-11-23 11:06:42Z

Natural language collations are more complicated than you think. They use different comparison levels, where higher levels are used as tie-breakers when strings compare equal on a lower level. Typically, accents and case are ignored at the primary level. At the secondary level, accents are respected, but case is ignored. On the tertiary level, case and accents are respected.

So the strings Etat, état and etat would compare identical on the primary level. On the secondary level, état would be greater than the other two, which would be equal. On the tertiary level, etat would be less than Etat. All in all, we end up with

'etat' < 'Etat' < 'état'

It is kind of arbitrary that upper case characters are greater than lower case characters, and with ICU collations you can configure most of these aspects.

In your example, BEET is less than Bernstein on the primary level, so that is the order in which the strings are sorted.

Collectives™ on Stack Overflow

Why does PostgreSQL sort part case-sensitively and part case-insensitively?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related