14

While trying to answer this question about SQL sorting, I noticed a sort order I did not expect:

$ export LC_ALL=en_US.UTF-8  
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700A Grouped
T-700A Halved
T-700 Whole
$ 

Why is 700 A sorted above 700A, while 700A is above 700 W ? I would expect a space to come before A consistently, independent of the characters following it.

It works fine if you use the C locale:

$ export LC_ALL=C
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700 Whole
T-700A Grouped
T-700A Halved
$ 
2
  • 1
    It appears that leading whitespace is ignored and trailing whitespace isn't. Commented Dec 30, 2015 at 23:26
  • @MichaelHomer: You're right, my mistake, typoed "A" for "A<space>". I'll update the question Commented Dec 30, 2015 at 23:38

1 Answer 1

18

Sorting is done in multiple passes. Each character has three (or sometimes more) weights assigned to it. Let's say for this example the weights are

         wt#1 wt#2 wt#3
space = [0000.0020.0002]
A     = [1BC2.0020.0008]

To create the sort key, the nonzero weights of the characters of a string are concatenated, one weight level at a time. That is, if a weight is zero, no corresponding weight is added (as can be seen at the beginning for " A"). So

       wt#1   -- wt#2 ---   -- wt#3 ---
" A" = 1BC2   0020   0020   0002   0008
       A      sp     A      sp     A

       wt#1   wt#2   wt#3
"A"  = 1BC2   0020   0008
       A      A      A

       wt#1   -- wt#2 ---   -- wt#3 ---
"A " = 1BC2   0020   0020   0008   0002
       A      A      sp     A      sp

If you sort these arrays you get the order you see:

       1BC2   0020   0008               => "A"
       1BC2   0020   0020   0002   0008 => " A"
       1BC2   0020   0020   0008   0002 => "A "

This is a simplification of what actually happens; see the Unicode Collation Algorithm for more details. The above example weights are actually from the standard table, with some details omitted.

2
  • Very helpful. The fact that it sorts A-Space above Space-A above W-Space cannot be explained by a constant code point for space. As you suggest it is probably multiple passes, which section 1.6 seems to explain. Commented Dec 31, 2015 at 0:23
  • I am a bit lost here: why A becomes A A A , why ` A` becomes A sp A sp A, why A becomes A A sp A sp ? and why those A are of weight 1BC2 ? and why sp is sometimes weight 0020 and sometimes 0002 ? Please explain a bit more for me :) Commented Jan 27, 2021 at 15:12

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.