5

I'm trying to build a regex to "reduce" duplicate consecutive substrings from a string in Java. For example, for the following input:

The big black dog big black dog is a friendly friendly dog who lives nearby nearby.

I'd like to get the following output:

The big black dog is a friendly dog who lives nearby.

This is the code I have so far:

String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

Pattern dupPattern = Pattern.compile("((\\b\\w+\\b\\s)+)\\1+", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);

while (matcher.find()) {
    input = input.replace(matcher.group(), matcher.group(1));
}

Which is working out fine for all duplicate substrings except for the end of the sentence:

The big black dog is a friendly dog who lives nearby nearby.

I understand that my regex requires a whitespace after each word in the substring, meaning it won't catch cases with a period instead of a space. I can't seem to find a workaround for this, I have tried playing around with the capture groups and also changing the regex to look for a whitespace or a period instead of just a whitespace, but this solution will only work if there is a period after each duplicate part of the substring ("nearby.nearby.").

Can somebody point me in the right direction? Ideally the inputs for this method will be short paragraphs and not just one-liners.

2
  • 1
    Do you HAVE TO use a regex or are you just interested in an efficient solution? Commented Jul 31, 2016 at 11:48
  • I don't have to use a regex actually, I just thought a regex could easily find duplicate phrases and not just duplicate words. Any other solution would also be welcome! Commented Jul 31, 2016 at 11:55

2 Answers 2

3

You can use

input.replaceAll("([ \\w]+)\\1", "$1");

See live demo:

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

        Pattern dupPattern = Pattern.compile("([ \\w]+)\\1", Pattern.CASE_INSENSITIVE);
        Matcher matcher = dupPattern.matcher(input);

        while (matcher.find()) {
            input = input.replaceAll("([ \\w]+)\\1", "$1");
        }
        System.out.println(input);

    }
}
Sign up to request clarification or add additional context in comments.

5 Comments

That would not work for the following input "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby."
@Matt OP said nothing about conflicting duplications. Even if they did so, the same regex can be used to de-duplicate in this way - repeat replacing until the string won't have any matches anymore.
Thank you Thomas, but there is an issue with word boundaries. For the following input: "This is my my dog" I would get "This my dog" won't I?
@ak_charlie just replace the regex to \\b([ \\w]+)\\1
Thanks Thomas, was just about to comment that I added the word boundary :)
2

Combine both @Thomas Ayoub's answer and @Matt's comment.

public class Test2 {
    public static void main(String args[]){
        String input = "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
        String result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        while(!input.equals(result)){
            input = result;
            result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        }
        System.out.println(result);
    }
}

2 Comments

Why do you introduce result?
@ThomasAyoub Hmmm, maybe for better readability. What's your opinion?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.