4

I tried to do a search on this particular problem, but all I get is either removal of duplicate lines or removal of repeated strings where they are separated by a delimiter.

My problem is slightly different. I have a string such as

    "comp name1 comp name2 comp name2 comp name3" 

where I want to remove the repeated comp name2 and return only

    "comp name1 comp name2 comp name3" 

They are not consecutive duplicate words, but consecutive duplicate substrings. Is there a way to solve this using regular expressions?

3
  • what if you have ` "comp name1 comp name2 comp name2 comp name3 comp name4 comp name2"` ? what will be the output? Commented Apr 5, 2011 at 3:21
  • Hi @kurumi, I am only interested in consecutive repeats only. So, the second (or the third in the input) comp name2 will be intact. Commented Apr 5, 2011 at 3:27
  • Does it have to be regular expressions? String methods would be better for this IMHO. Commented Apr 5, 2011 at 3:42

5 Answers 5

8
s/(.*)\1/$1/g

Be warned that the running time of this regular expression is quadratic in the length of the string.

Sign up to request clarification or add additional context in comments.

3 Comments

I am aware of the time complexity. In my case these are fairly short strings (~100 chars max) and would not take that long.
@btilly : how about the same problem with lines and not strings????.. If i have consecutive duplicate lines??
@unkaitha: perl -ne 'print unless $seen{$_}++' file.txt > no_dupe_lines.txt
3

This works for me (MacOS X 10.6.7, Perl 5.13.4):

use strict;
use warnings;

my $input = "comp name1 comp name2 comp name2 comp name3" ;
my $output = "comp name1 comp name2 comp name3" ;

my $result = $input;
$result =~ s/(.*)\1/$1/g;

print "In:   <<$input>>\n";
print "Want: <<$output>>\n";
print "Got:  <<$result>>\n";

The key point is the '\1' in the matching.

1 Comment

Slight variation of @btilly's solution. Thanks, but will have to go with the other as he was first.
2

To avoid removing duplicate characters within the terms (e.g. comm1 -> com1) bracket .* in regular expression with \b.

s/(\b.*\b)\1/$1/g

Comments

1

I never work with languages that support this but since you are using Perl ...

Go here .. and see this section....

Useful Example: Checking for Doubled Words

When editing text, doubled words such as "the the" easily creep in. Using the regex \b(\w+)\s+\1\b in your text editor, you can easily find them. To delete the second word, simply type in \1 as the replacement text and click the Replace button.

4 Comments

Please, please, please. Don't call the language "pearl". It is "Perl" and the executable is "perl".
@btilly: fixed for him - I agree 100%. Also, the question is not about simple 'doubled words'; it is about 'doubled phrases' where the phrase might consist of more than one word. The answer you give can be extended to get to the required answer, but ...
I found this in my searches, but it is only for repeated words and not strings. My substrings have word boundaries in them so this doesn't work.
Yes, I should have used "double phrases" instead of substrings.
1

If you need something running in linear time, you could split the string and iterate through the list:

#!/usr/bin/perl                                                                                                                                                                                       

use strict;
use warnings;

my $str = "comp name1 comp name2 comp name2 comp name3";
my @elems = split("\\s", $str);
my $prevComp;
my $prevFlag = -1;
foreach my $elemIdx (0..(scalar @elems - 1)) {
    if ($elemIdx % 2 == 1) {
        if (defined $prevComp) {
            if ($prevComp ne $elems[$elemIdx]) {
                print " $elems[$elemIdx]";
                $prevFlag = 0;
            }
            else {
                $prevFlag = 1;
            }
        }
        else {
            print " $elems[$elemIdx]";
        }
        $prevComp = $elems[$elemIdx];
    }
    elsif ($prevFlag == -1) {
        print "$elems[$elemIdx]";
        $prevFlag = 0;
    }
    elsif ($prevFlag == 0) {
        print " $elems[$elemIdx]";
    }
}
print "\n";

Dirty, perhaps, but should run faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.