Java regex to remove duplicate substrings from string

Question

I'm trying to build a regex to "reduce" duplicate consecutive substrings from a string in Java. For example, for the following input:

The big black dog big black dog is a friendly friendly dog who lives nearby nearby.

I'd like to get the following output:

The big black dog is a friendly dog who lives nearby.

This is the code I have so far:

String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

Pattern dupPattern = Pattern.compile("((\\b\\w+\\b\\s)+)\\1+", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);

while (matcher.find()) {
    input = input.replace(matcher.group(), matcher.group(1));
}

Which is working out fine for all duplicate substrings except for the end of the sentence:

The big black dog is a friendly dog who lives nearby nearby.

I understand that my regex requires a whitespace after each word in the substring, meaning it won't catch cases with a period instead of a space. I can't seem to find a workaround for this, I have tried playing around with the capture groups and also changing the regex to look for a whitespace or a period instead of just a whitespace, but this solution will only work if there is a period after each duplicate part of the substring ("nearby.nearby.").

Can somebody point me in the right direction? Ideally the inputs for this method will be short paragraphs and not just one-liners.

Do you HAVE TO use a regex or are you just interested in an efficient solution? — Jan B.
– Jan B., Commented Jul 31, 2016 at 11:48
I don't have to use a regex actually, I just thought a regex could easily find duplicate phrases and not just duplicate words. Any other solution would also be welcome! — ak_charlie
– ak_charlie, Commented Jul 31, 2016 at 11:55

Thomas Ayoub · Accepted Answer · 2016-07-31 12:13:15Z

3

You can use

input.replaceAll("([ \\w]+)\\1", "$1");

See live demo:

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

        Pattern dupPattern = Pattern.compile("([ \\w]+)\\1", Pattern.CASE_INSENSITIVE);
        Matcher matcher = dupPattern.matcher(input);

        while (matcher.find()) {
            input = input.replaceAll("([ \\w]+)\\1", "$1");
        }
        System.out.println(input);

    }
}

edited Jul 31, 2016 at 12:13

answered Jul 31, 2016 at 11:52

Thomas Ayoub

29.6k16 gold badges98 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jan B. Over a year ago

That would not work for the following input "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby."

nicael Over a year ago

@Matt OP said nothing about conflicting duplications. Even if they did so, the same regex can be used to de-duplicate in this way - repeat replacing until the string won't have any matches anymore.

ak_charlie Over a year ago

Thank you Thomas, but there is an issue with word boundaries. For the following input: "This is my my dog" I would get "This my dog" won't I?

Thomas Ayoub Over a year ago

@ak_charlie just replace the regex to \\b([ \\w]+)\\1

ak_charlie Over a year ago

Thanks Thomas, was just about to comment that I added the word boundary :)

Eugene · Accepted Answer · 2016-07-31 17:29:00Z

2

Combine both @Thomas Ayoub's answer and @Matt's comment.

public class Test2 {
    public static void main(String args[]){
        String input = "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
        String result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        while(!input.equals(result)){
            input = result;
            result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        }
        System.out.println(result);
    }
}

edited Jul 31, 2016 at 17:29

answered Jul 31, 2016 at 12:12

Eugene

11.2k7 gold badges57 silver badges73 bronze badges

2 Comments

Thomas Ayoub Over a year ago

Why do you introduce result?

Eugene Over a year ago

@ThomasAyoub Hmmm, maybe for better readability. What's your opinion?

Collectives™ on Stack Overflow

Java regex to remove duplicate substrings from string

2 Answers 2

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related