Using Perl, how can I sort an array using the value of a number inside each array element?

Question

Let's say I have an array, @theArr, which holds 1,000 or so elements such as the following:

01  '12 16 sj.1012804p1012831.93.gz'
02  '12 16 sj.1012832p1012859.94.gz'
03  '12 16 sj.1012860p1012887.95.gz'
04  '12 16 sj.1012888p1012915.96.gz'
05  '12 16 sj.1012916p1012943.97.gz'
06  '12 16 sj.875352p875407.01.gz'
07  '12 16 sj.875408p875435.02.gz'
08  '12 16 sj.875436p875535.03.gz'
09  '12 16 sj.875536p875575.04.gz'
10  '12 16 sj.875576p875603.05.gz'
11  '12 16 sj.875604p875631.06.gz'
12  '12 16 sj.875632p875659.07.gz'
13  '12 16 sj.875660p875687.08.gz'
14  '12 16 sj.875688p875715.09.gz'
15  '12 16 sj.875716p875743.10.gz'
...

If my first set of numbers (between the 'sj.' and the 'p') was always 6 digits, I wouldn't have a problem. But, when the numbers roll over into 7 digits the default sort stops working as the larger 7 digit numbers comes before the smaller 6 digit number.

Is there a way to tell Perl to sort by that number inside the string in each array element?

Chas. Owens · Accepted Answer · 2009-05-01 01:16:09Z

18

Looks like you need a Schwartzian Transform:

#!/usr/bin/perl

use strict;
use warnings;

my @a = <DATA>;

print 
    map  { $_->[1] }                #get the original value back
    sort { $a->[0] <=> $b->[0] }    #sort arrayrefs numerically on the sort value
    map  { /sj\.(.*?)p/; [$1, $_] } #build arrayref of the sort value and orig
    @a;

__DATA__
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
12 16 sj.875604p875631.06.gz
12 16 sj.875632p875659.07.gz
12 16 sj.875660p875687.08.gz
12 16 sj.875688p875715.09.gz
12 16 sj.875716p875743.10.gz

edited May 1, 2009 at 1:16

answered May 1, 2009 at 1:06

Chas. Owens

65.1k25 gold badges139 silver badges232 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Matt K Over a year ago

Your regex is wrong. The number part stops on the "p", not the ., so your regex should be /sj\.(\d+)p/

Chas. Owens Over a year ago

Do not use \d to mean [0-9]. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

j_random_hacker Over a year ago

@Chas: I did not know that about \d. (That must be what's causing my bugs -- all those MONGOLIAN DIGIT 5s out there... ;))

Nick Messick Over a year ago

One thing though: depending on the data, this seems to sort it in descending order or sometimes in ascending order. Any reason why?

Nick Messick Over a year ago

nevermind, I had data that didn't match the regexp in the second map

Matt K · Accepted Answer · 2009-05-01 14:37:44Z

3

You can use a regex to pull the number out of every line inside the block you pass to the sort function:

@newArray = sort { my ($anum,$bnum); $a =~ /sj\.([0-9]+)p/; $anum = $1; $b =~ /sj\.(\d+)p/; $bnum = $1; $anum <=> $bnum } @theArr;

However, Chas. Owens's solution is better, since it only does the regex matches once for every element.

edited May 1, 2009 at 14:37

answered May 1, 2009 at 1:11

Matt K

14k3 gold badges36 silver badges54 bronze badges

1 Comment

Chas. Owens Over a year ago

\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

Plate · Accepted Answer · 2009-05-01 01:38:57Z

Here's an example that sorts them ascending, assuming you don't care too much about efficiency:

use strict;

my @theArr = split(/\n/, <<END_SAMPLE);
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
END_SAMPLE

my @sortedArr = sort compareBySJ @theArr;

print "Before:\n".join("\n", @theArr)."\n";
print "After:\n".join("\n", @sortedArr)."\n";

sub compareBySJ {
    # Capture the values to compare, against the expected format
    # NOTE: This could be inefficient for large, unsorted arrays
    #       since you'll be matching the same strings repeatedly
    my ($aVal) = $a =~ /^\d+\s+\d+\s+sj\.(\d+)p/
        or die "Couldn't match against value $a";
    my ($bVal) = $b =~ /^\d+\s+\d+\s+sj\.(\d+)p/
        or die "Couldn't match against value $a";

    # Return the numerical comparison of the values (ascending order)
    return $aVal <=> $bVal;
}

Outputs:

Before:
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
After:
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz

\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

RBerteig · Accepted Answer · 2009-05-01 01:08:18Z

1

Yes. The sort function takes an optional comparison function which will be used to compare two elements. It can take the form of either a block of code, or the name of a function to call.

There is an example at the linked document that is similar to what you want to do:

# inefficiently sort by descending numeric compare using
# the first integer after the first = sign, or the
# whole record case-insensitively otherwise

@new = sort {
($b =~ /=(\d+)/)[0] <=> ($a =~ /=(\d+)/)[0]
            ||
            uc($a)  cmp  uc($b)
} @old;

answered May 1, 2009 at 1:08

RBerteig

43.7k7 gold badges92 silver badges131 bronze badges

11 Comments

Chas. Owens Over a year ago

\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

j_random_hacker Over a year ago

+1, but Chas. Owens' solution is likely to be quite a bit faster as regex matching is only performed once.

Telemachus Over a year ago

So that's three (maybe four times) in one thread that we have heard about the dreaded 'MONGOLIAN DIGIT' problem. I'm genuinely curious: did you have a really bad case of Mongolian data flu at some point?

Chas. Owens Over a year ago

No, just trying to make sure people get the news to stop using \d (at least in Perl 5.8 and 5.10). And maybe if enough people find out, there will be enough pressure to get it fixed in 5.12. U+1815 is just a handy you-will-never-want-to-match-this character.

Chas. Owens Over a year ago

@ Michael Carman - The problem is "knowing" your data is ASCII. We are increasingly moving into a world were UTF-8 is the default character encoding. Any code you write today that assumes it is working on ASCII will break tomorrow. As for matching any digit characters, there is always \p{N}, \p{Nd}, \p{Nl}, \p{No}, which are much better since the state explicitly what type of digit you are looking for. Until "\x{1815}" + 1 is 6, \d should mean [0-9] because people use \d to mean "numbers I can do math with".

|

Collectives™ on Stack Overflow

Using Perl, how can I sort an array using the value of a number inside each array element?

4 Answers 4

5 Comments

1 Comment

1 Comment

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

1 Comment

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related