How can Perl and Unix sort, order Unicode strings in the same sequence?

Question

I am trying to get Perl and the GNU/Linux sort(1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8. In the Perl program I have tried the following methods:

use Unicode::Collate with $Collator = Unicode::Collate->new();
use Unicode::Collate::Locale with $Collator = Unicode::Collate->new(locale => $ENV{'LANG'});
use locale

Each one of them failed with the following errors (from the Perl side):

Input is not sorted: [----,] came after [($1]
Input is not sorted: [...] came after [&]
Input is not sorted: [($1] came after [1]

The only method that worked for me involved setting LC_ALL=C for sort, and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.

Are you calling sort properly? Unicode::Collate doesn't change the default behavior of sort; you have to use a custom comparison function. — cjm
– cjm, Commented Sep 14, 2014 at 17:55
The actual Perl code (for 8-bit characters) is at github.com/dspinellis/sgsh/blob/master/sgsh-merge-sum.pl. It is designed to merge the output of multiple sort | uniq -c invocations. — Diomidis Spinellis
– Diomidis Spinellis, Commented Sep 14, 2014 at 18:36
That is to be expected. The precedence is LC_COLLATE, if not defined LC_ALL, if not defined LANG. See pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html — Diomidis Spinellis
– Diomidis Spinellis, Commented Sep 15, 2014 at 17:26

ikegami · Accepted Answer · 2015-05-28 14:43:24Z

4

Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale; is for.

I don't know why you didn't get the desired order out of cmp under use locale;.

You could process the decompressed files.

for q in file1.uniqc file2.uniqc ; do
   perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q"
done | sort | uniq -c

It'll require more temporary storage, of course, but you'll get exactly the order you want.

I found a case use locale; didn't cause Perl's sort/cmp to give the same result as the sort utility. Weird.

$ export LC_COLLATE=en_US.UTF-8

$ perl -Mlocale -e'print for sort { $a cmp $b } <>' data
(
($1
1

$ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data
(
($1
1

$ sort data
(
1
($1

Truth be told, it's the sort utility that's weird.

In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.

edited May 28, 2015 at 14:43

answered Sep 14, 2014 at 19:08

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Diomidis Spinellis Over a year ago

I'm benchmarking performance on a 20GB data set, so I can't afford a suboptimal solution. The case you describe is exactly the type of problem I'm facing. Note that I don't care a lot about the particular locale that will be used, as long as it works reasonably with Unicode strings (e.g. DUCET), and it works the same with sort(1) and Perl.

ikegami Over a year ago

Re "I'm benchmarking performance on a 20GB data set", So what was the resul?

ikegami Over a year ago

RE "it works the same with sort(1) and Perl", Is that really true? Do you actually need to use the sort utility?

ninjalj Over a year ago

Doesn't Perl use the UCA for sorting, while glibc uses ISO 14651?

ikegami Over a year ago

@ninjalj, I thought locale-based sorting was defined by system files? (I heard about broken locale on machines many times.)

|

Peter H · Accepted Answer · 2021-09-11 02:20:46Z

I can't answer directly, but I had problems getting a simple script to sort Serbian Latin text correctly, I found https://www.perl.com/pub/2012/06/perlunicook-demo-of-unicode-collation-and-printing.html/, copied his setup (my actual processing is much simpler than his), and finally got the correct alphabetic sorting for that language and locale. There's about as much as anyone would need to know about Unicode linguistic sorting in the whole set of guides at https://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html/.

I assume you want to sort Greek. Here's a very simple version of what I copied and adapted from the guide, which sorts correctly.

# min required setup for trial sort
use utf8;
use v5.14; # for locale sorting and unicode_strings
use Unicode::Normalize;
use Unicode::Collate::Locale;
my @words = qw{
        Η
        Ιθάκη
        σ'
        έδωσε
        το
        ωραίο
        ταξίδι.
        Χωρίς
        αυτήν
        δεν
        θάβγαινες
        στον
        δρόμο.
};
print "Unsorted: @words\n";
my $coll = Unicode::Collate::Locale->new( locale => "el_GR" );
my @sorted_words = $coll->sort(@words);
print "Sorted: @sorted_words\n";

Collectives™ on Stack Overflow

How can Perl and Unix sort, order Unicode strings in the same sequence?

2 Answers 2

11 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related