7

I want to split a string using repeating letters as delimiter, for example, "123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:

@s = split /([[:alpha:]])\1+/, '123aaaa23a3';

But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference. How can I resolve this situation?

2
  • 1
    I don't think you can modify the regex to avoid having a capturing group, but you can just throw away all of the odd-numbered elements of the list returned by split Commented Sep 21, 2015 at 3:27
  • 2
    If the regex has capture groups, the returned list contains the matched/grouped substrings as well. You could use an alternative: my $str = '123aaaa23a3' =~ s/([[:alpha:]])\1+/~~/r; my @s = split /~~/, $str; Commented Sep 21, 2015 at 3:44

3 Answers 3

4

Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.

Something like this perhaps?:

my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;

This'll give you:

$VAR1 = {
          '23a3' => '',
          '123' => 'a'
        };

So you can extract your pattern via keys.

Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).

But something like this:

my @delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g; 
print Dumper \%+;

By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.

This is the closest I got:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my $str = '123aaaa23a3';

#build a regex out of '2-or-more' characters. 
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";

#split on the regex
my @s  = split m/$regex/, $str;

print Dumper \@s;

We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.

Sign up to request clarification or add additional context in comments.

Comments

2

One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:

use List::Util 1.29 qw( pairkeys );

my @vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';

Gives

Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]

That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:

my @vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;

Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:

use List::Util 1.29 qw( pairvalues );

my @vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';

Comments

0

The 'split' can be made to work directly by using the delayed execution assertion (aka postponed regular subexpression), (??{ code }), in the regular expression:

@s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';

(??{ code }) is documented on the 'perlre' manual page.

Note that, according to the 'perlvar' manual page, the use of $& anywhere in a program imposes a considerable performance penalty on all regular expression matches. I've never found this to be a problem, but YMMV.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.