Perl regular expression removing duplicate consecutive substrings in a string

Question

I tried to do a search on this particular problem, but all I get is either removal of duplicate lines or removal of repeated strings where they are separated by a delimiter.

My problem is slightly different. I have a string such as

    "comp name1 comp name2 comp name2 comp name3"

where I want to remove the repeated comp name2 and return only

    "comp name1 comp name2 comp name3"

They are not consecutive duplicate words, but consecutive duplicate substrings. Is there a way to solve this using regular expressions?

what if you have ` "comp name1 comp name2 comp name2 comp name3 comp name4 comp name2"` ? what will be the output? — kurumi
– kurumi, Commented Apr 5, 2011 at 3:21
Hi @kurumi, I am only interested in consecutive repeats only. So, the second (or the third in the input) comp name2 will be intact. — Rasika
– Rasika, Commented Apr 5, 2011 at 3:27
Does it have to be regular expressions? String methods would be better for this IMHO. — Justin Morgan
– Justin Morgan, Commented Apr 5, 2011 at 3:42

btilly · Accepted Answer · 2011-04-05 03:21:05Z

8

s/(.*)\1/$1/g

Be warned that the running time of this regular expression is quadratic in the length of the string.

answered Apr 5, 2011 at 3:21

btilly

47.8k3 gold badges70 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rasika Over a year ago

I am aware of the time complexity. In my case these are fairly short strings (~100 chars max) and would not take that long.

unkaitha Over a year ago

@btilly : how about the same problem with lines and not strings????.. If i have consecutive duplicate lines??

btilly Over a year ago

@unkaitha: perl -ne 'print unless $seen{$_}++' file.txt > no_dupe_lines.txt

Jonathan Leffler · Accepted Answer · 2011-04-05 03:24:42Z

3

This works for me (MacOS X 10.6.7, Perl 5.13.4):

use strict;
use warnings;

my $input = "comp name1 comp name2 comp name2 comp name3" ;
my $output = "comp name1 comp name2 comp name3" ;

my $result = $input;
$result =~ s/(.*)\1/$1/g;

print "In:   <<$input>>\n";
print "Want: <<$output>>\n";
print "Got:  <<$result>>\n";

The key point is the '\1' in the matching.

answered Apr 5, 2011 at 3:24

Jonathan Leffler

760k145 gold badges961 silver badges1.3k bronze badges

1 Comment

Rasika Over a year ago

Slight variation of @btilly's solution. Thanks, but will have to go with the other as he was first.

ollo · Accepted Answer · 2013-03-05 17:41:24Z

2

To avoid removing duplicate characters within the terms (e.g. comm1 -> com1) bracket .* in regular expression with \b.

s/(\b.*\b)\1/$1/g

edited Mar 5, 2013 at 17:41

ollo

25.5k15 gold badges112 silver badges158 bronze badges

answered Mar 5, 2013 at 17:25

Anonymous

211 bronze badge

Comments

Jonathan Leffler · Accepted Answer · 2011-04-05 03:26:55Z

1

I never work with languages that support this but since you are using Perl ...

Go here .. and see this section....

Useful Example: Checking for Doubled Words

When editing text, doubled words such as "the the" easily creep in. Using the regex \b(\w+)\s+\1\b in your text editor, you can easily find them. To delete the second word, simply type in \1 as the replacement text and click the Replace button.

edited Apr 5, 2011 at 3:26

Jonathan Leffler

760k145 gold badges961 silver badges1.3k bronze badges

answered Apr 5, 2011 at 3:21

John Sobolewski

4,5621 gold badge23 silver badges27 bronze badges

4 Comments

btilly Over a year ago

Please, please, please. Don't call the language "pearl". It is "Perl" and the executable is "perl".

Jonathan Leffler Over a year ago

@btilly: fixed for him - I agree 100%. Also, the question is not about simple 'doubled words'; it is about 'doubled phrases' where the phrase might consist of more than one word. The answer you give can be extended to get to the required answer, but ...

Rasika Over a year ago

I found this in my searches, but it is only for repeated words and not strings. My substrings have word boundaries in them so this doesn't work.

Rasika Over a year ago

Yes, I should have used "double phrases" instead of substrings.

Alex Reynolds · Accepted Answer · 2011-04-05 03:36:46Z

If you need something running in linear time, you could split the string and iterate through the list:

#!/usr/bin/perl                                                                                                                                                                                       

use strict;
use warnings;

my $str = "comp name1 comp name2 comp name2 comp name3";
my @elems = split("\\s", $str);
my $prevComp;
my $prevFlag = -1;
foreach my $elemIdx (0..(scalar @elems - 1)) {
    if ($elemIdx % 2 == 1) {
        if (defined $prevComp) {
            if ($prevComp ne $elems[$elemIdx]) {
                print " $elems[$elemIdx]";
                $prevFlag = 0;
            }
            else {
                $prevFlag = 1;
            }
        }
        else {
            print " $elems[$elemIdx]";
        }
        $prevComp = $elems[$elemIdx];
    }
    elsif ($prevFlag == -1) {
        print "$elems[$elemIdx]";
        $prevFlag = 0;
    }
    elsif ($prevFlag == 0) {
        print " $elems[$elemIdx]";
    }
}
print "\n";

Dirty, perhaps, but should run faster.

Collectives™ on Stack Overflow

Perl regular expression removing duplicate consecutive substrings in a string

5 Answers 5

3 Comments

1 Comment

Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

1 Comment

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related