Perl split function - use repeating characters as delimiter

Question

I want to split a string using repeating letters as delimiter, for example, "123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:

@s = split /([[:alpha:]])\1+/, '123aaaa23a3';

But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference. How can I resolve this situation?

I don't think you can modify the regex to avoid having a capturing group, but you can just throw away all of the odd-numbered elements of the list returned by split — hobbs
– hobbs, Commented Sep 21, 2015 at 3:27
If the regex has capture groups, the returned list contains the matched/grouped substrings as well. You could use an alternative: my $str = '123aaaa23a3' =~ s/([[:alpha:]])\1+/~~/r; my @s = split /~~/, $str; — hwnd
– hwnd, Commented Sep 21, 2015 at 3:44

Sobrique · Accepted Answer · 2015-09-21 10:13:07Z

Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.

Something like this perhaps?:

my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;

This'll give you:

$VAR1 = {
          '23a3' => '',
          '123' => 'a'
        };

So you can extract your pattern via keys.

Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).

But something like this:

my @delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g; 
print Dumper \%+;

By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.

This is the closest I got:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my $str = '123aaaa23a3';

#build a regex out of '2-or-more' characters. 
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";

#split on the regex
my @s  = split m/$regex/, $str;

print Dumper \@s;

We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.

LeoNerd · Accepted Answer · 2015-09-21 10:31:55Z

One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:

use List::Util 1.29 qw( pairkeys );

my @vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';

Gives

Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]

That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:

my @vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;

Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:

use List::Util 1.29 qw( pairvalues );

my @vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';

pjh · Accepted Answer · 2015-09-28 19:23:15Z

0

The 'split' can be made to work directly by using the delayed execution assertion (aka postponed regular subexpression), (??{ code }), in the regular expression:

@s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';

(??{ code }) is documented on the 'perlre' manual page.

Note that, according to the 'perlvar' manual page, the use of $& anywhere in a program imposes a considerable performance penalty on all regular expression matches. I've never found this to be a problem, but YMMV.

answered Sep 28, 2015 at 19:23

pjh

8,4232 gold badges20 silver badges21 bronze badges

Collectives™ on Stack Overflow

Perl split function - use repeating characters as delimiter

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related