6

I've been playing with this regex in Java for ages and can't get it to work:

(?:^| )(?:the|and|at|in|or|on|off|all|beside|under|over|next)(?: |$)

The following:

pattern.matcher("the cat in the hat").replaceAll(" ")

gives me cat the hat. Another example input is the cat in of the next hat which gives me cat of next hat.

Is there any way I can make this regex replacement work without having to break them out into multiple separate regexes for each word and try to replace a string repeatedly?

0

2 Answers 2

10

Yeah, you can do this pretty easily you just need to use boundaries, which is what you're trying to describe with: (?:^| ) Just do this instead:

\b(?:the|and|at|in|or|on|off|all|beside|under|over|next)\b

Your original didn't capture, but as is mentioned in the comments, if you want to capture the options you can use a capturing instead of a non-capturing group:

\b(the|and|at|in|or|on|off|all|beside|under|over|next)\b
Sign up to request clarification or add additional context in comments.

9 Comments

You might also need match groups: (\b(?:the|and|at|in|or|on|off|all|beside|under|over|next)\b)
@frhd The best solution would then be to simply replace the non-capturing group by a capturing one: \b(the|and|at|in|or|on|off|all|beside|under|over|next)\b
@sp00m yep, this answer should be updated with your fix.
I don't need to know what was removed, so I can leave the capturing group out. It works, thanks, but I don't really understand why the original one doesn't work.
@frhd Well, it all depends on whether the OP needs to capture the data or not ;)
|
5

The problem with yours is that the leading and trailing spaces are included in the matches, and a char cannot be found in two matches.

So with the input the_cat_in_the_hat (the underscores replace the spaces here, to make the explanation clearer):

  1. First match: the_, remaining string: cat_in_the_hat
  2. Second match: _in_, remaining string: the_hat
  3. the is not matched, since it is neither preceded by a space nor by the beginning of the (original) string.

You could have used lookarounds instead, since they behave like conditions (i.e. if):

(?<=^| )(?:the|and|at|in|or|on|off|all|beside|under|over|next)(?= |$)

Regular expression visualization

Debuggex Demo

This way, you would have:

  1. First match: the, remaining string: _cat_in_the_hat
  2. Second match: in, remaining string: _the_hat
  3. Third match: the, remaining string: _hat

But @JonathanMee answer is the best solution, since word boundaries were implemented precisly for this purpose ;)

1 Comment

This is an excellent description of the problem, I prefer my final solution, but +1 because this makes a better answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.