1

I am working on a problem that removes duplicated words from a string. E.g.,

Input: Goodbye bye bye world world world

Output: Goodbye bye world

I have got a working pattern from online resources, but I am not able to understand all the content in it.

    String pattern = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";

Here is my understanding:

  1. the initial \\b is to match word bounaries
  2. (\\w+) matches one or more characters
  3. in this expression : (\\b\\W+\\b\\1\\b)*

    a. \\b matches word boundaries

    b. \\W+ matches one or more non-word characters

    c. \\b again matches a word bounary

    d. \\1 ??? I dont know what this is for, but it wont work without this

    c. \\b again matches for a word bounary

As you can see, my main confusion is about item 3 and especially \\1. Anyone can explain it more clearly?

4
  • Hi. I always use regexr to test and try regular expressions click here if you put the pointer over the expressions it shows messages and it explains what is going on Commented Jan 23, 2017 at 19:20
  • @GabrielMarques, thanks for the link. However, neither my pattern or the one written by anubhava work in this web editor. Is the syntax the same as java regex? Commented Jan 23, 2017 at 19:33
  • yes, try to remove the double back slash character '\' and it wil works. You use double back slashes cause youbare writing the expression in a string and you double it to escape Commented Jan 23, 2017 at 19:52
  • 1
    @anubhava Yes, thank you! Commented Jan 14, 2019 at 16:29

1 Answer 1

7

Using Java you can use a lookahead to remove all the words that have same matched word ahead using a back-reference:

final String regex = "\\b(\\w+)\\b\\s*(?=.*\\b\\1\\b)";
final String input = "Goodbye bye bye world world world\n";

final String result = input.replaceAll(regex, "");

It is important to use word boundaries here to avoid matching partial words.

RegEx Demo

Sign up to request clarification or add additional context in comments.

5 Comments

It doesn't work if my sentence is 'I work and I sleep' and I expect output as 'I work and sleep'
Seems we can't achieve this with Regex approach if words are non-consecutive and we need to write a program. What you say?
Actually just realized this regex is removing non-consecutive repeats also but it can only only remove all non-last repeats
@anubhava in other words this does not work and is wrong when the number of matches is odd.
It works in a way by removing all but last occurrence of the repeating word.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.