1

I cannot understand the behavior of PostgreSQL (v11.10). Here is what I do:

create temp table test (first_name text, last_name text);
insert into test values
  ('Hanna', 'Beat'),
  ('JOAN', 'BEET'),
  ('Mark', 'Bernstein'),
  ('ALFRED', 'DOE'),
  ('henry', 'doe'),
  ('Henry', 'Doe'),
  ('Dennis', 'Doe');
select last_name, first_name from test order by last_name, first_name;

This is what I get.

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

It looks like the sorting of the first three names is case-insensitive, but for the last four it's case-sensitive. Why is that so?

In other words, if the sorting were case-sensitive, I would expect the following order:

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 Bernstein | Mark
 BEET      | JOAN
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

and if it were case-insensitive, I would expect this:

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 DOE       | ALFRED
 Doe       | Dennis
 doe       | henry
 Doe       | Henry
(7 rows)

What I get instead is a mixture of both, and that baffles me...

For completeness:

# show lc_collate; show lc_ctype;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

  lc_ctype   
-------------
 en_US.UTF-8
(1 row)
5
  • 1
    The sorting looks case sensitive do me. Can you describe your doubt in more detail? Commented Nov 23, 2022 at 10:30
  • @LaurenzAlbe for case-sensitive order, I would expect Beat, then Bernstein, then BEET. I edited the question to make it more clear. Commented Nov 23, 2022 at 10:36
  • 1
    It's fully case sensitive but the order is aAbBcCdDeEfF..., not abcdef...ABCDEF or ABCDEF...abcdef. Demo Commented Nov 23, 2022 at 10:56
  • Ah, that could make sense, thanks! Where is it documented? Commented Nov 23, 2022 at 10:59
  • 1
    Here which links to this - if you delve deep enough you'll see it can extend into diacritics and emojis: a <<< A << à <<< À < b <<< B Commented Nov 23, 2022 at 11:07

1 Answer 1

2

Natural language collations are more complicated than you think. They use different comparison levels, where higher levels are used as tie-breakers when strings compare equal on a lower level. Typically, accents and case are ignored at the primary level. At the secondary level, accents are respected, but case is ignored. On the tertiary level, case and accents are respected.

So the strings Etat, état and etat would compare identical on the primary level. On the secondary level, état would be greater than the other two, which would be equal. On the tertiary level, etat would be less than Etat. All in all, we end up with

'etat' < 'Etat' < 'état'

It is kind of arbitrary that upper case characters are greater than lower case characters, and with ICU collations you can configure most of these aspects.

In your example, BEET is less than Bernstein on the primary level, so that is the order in which the strings are sorted.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.