25

I am trying to get Perl and the GNU/Linux sort(1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8. In the Perl program I have tried the following methods:

Each one of them failed with the following errors (from the Perl side):

  • Input is not sorted: [----,] came after [($1]
  • Input is not sorted: [...] came after [&]
  • Input is not sorted: [($1] came after [1]

The only method that worked for me involved setting LC_ALL=C for sort, and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.

18
  • 2
    Are you calling sort properly? Unicode::Collate doesn't change the default behavior of sort; you have to use a custom comparison function. Commented Sep 14, 2014 at 17:55
  • 2
    The actual Perl code (for 8-bit characters) is at github.com/dspinellis/sgsh/blob/master/sgsh-merge-sum.pl. It is designed to merge the output of multiple sort | uniq -c invocations. Commented Sep 14, 2014 at 18:36
  • 8
    Note that sort uses LC_COLLATE, not LANG. Commented Sep 14, 2014 at 20:03
  • 2
    See also: stackoverflow.com/questions/20226851/… Commented Sep 14, 2014 at 22:54
  • 3
    That is to be expected. The precedence is LC_COLLATE, if not defined LC_ALL, if not defined LANG. See pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html Commented Sep 15, 2014 at 17:26

2 Answers 2

4

Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale; is for.

I don't know why you didn't get the desired order out of cmp under use locale;.

You could process the decompressed files.

for q in file1.uniqc file2.uniqc ; do
   perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q"
done | sort | uniq -c

It'll require more temporary storage, of course, but you'll get exactly the order you want.


I found a case use locale; didn't cause Perl's sort/cmp to give the same result as the sort utility. Weird.

$ export LC_COLLATE=en_US.UTF-8

$ perl -Mlocale -e'print for sort { $a cmp $b } <>' data
(
($1
1

$ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data
(
($1
1

$ sort data
(
1
($1

Truth be told, it's the sort utility that's weird.


In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.

Sign up to request clarification or add additional context in comments.

11 Comments

I'm benchmarking performance on a 20GB data set, so I can't afford a suboptimal solution. The case you describe is exactly the type of problem I'm facing. Note that I don't care a lot about the particular locale that will be used, as long as it works reasonably with Unicode strings (e.g. DUCET), and it works the same with sort(1) and Perl.
Re "I'm benchmarking performance on a 20GB data set", So what was the resul?
RE "it works the same with sort(1) and Perl", Is that really true? Do you actually need to use the sort utility?
Doesn't Perl use the UCA for sorting, while glibc uses ISO 14651?
@ninjalj, I thought locale-based sorting was defined by system files? (I heard about broken locale on machines many times.)
|
1

I can't answer directly, but I had problems getting a simple script to sort Serbian Latin text correctly, I found https://www.perl.com/pub/2012/06/perlunicook-demo-of-unicode-collation-and-printing.html/, copied his setup (my actual processing is much simpler than his), and finally got the correct alphabetic sorting for that language and locale. There's about as much as anyone would need to know about Unicode linguistic sorting in the whole set of guides at https://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html/.

I assume you want to sort Greek. Here's a very simple version of what I copied and adapted from the guide, which sorts correctly.

# min required setup for trial sort
use utf8;
use v5.14; # for locale sorting and unicode_strings
use Unicode::Normalize;
use Unicode::Collate::Locale;
my @words = qw{
        Η
        Ιθάκη
        σ'
        έδωσε
        το
        ωραίο
        ταξίδι.
        Χωρίς
        αυτήν
        δεν
        θάβγαινες
        στον
        δρόμο.
};
print "Unsorted: @words\n";
my $coll = Unicode::Collate::Locale->new( locale => "el_GR" );
my @sorted_words = $coll->sort(@words);
print "Sorted: @sorted_words\n";

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.