2

https://regex101.com/r/kBxa7R/2

I have following regex: \b(\w+) \b(?=.*\b\1)

I need to remove all duplicates in string. So for instance:

Mike Tyson 1. Street 1234 Vietnam ML(12534/97632) Mike Tyson 1234 1. Street Vietnam ML(12534/97632)

should results in:

Mike Tyson 1. Street 1234 Vietnam ML(1234/97632)

I already know why it fails, but I do not know how to fix it. I only look for \w+ and therefore "1." or "ML(156746/615893)" is not beeing found. But when I add these missing characters manually or replace the whole statement by .+ weird stuff is going on.

Can someone help?

0

1 Answer 1

3

You may use this regex:

(?<!\S)(\S+)\h+(?=(?:\S+\h+)*?\1(?!\S))

Updated RegEx Demo

RegEx Details:

  • (?<!\S): Lookbehind to assert that we don't have a non-space at previous position
  • (\S+): Match 1+ non-whitespace character and capture this in group #1
  • \h+: Match 1+ whitespace
  • (?=: Start positive lookahead
    • (?:\S+\h+)*?: Lazily match 0 more groups where each group consists of 1+ non-whitespace characters followed by 1+ space
    • \1: Back reference for group #1
    • (?!\S): Must not be followed by a non-whitespace to avoid partial matches
  • ): End positive lookahead

Casimir has made a very good suggestion in comments of using verb (*SKIP) for PCRE flavors as well. This appears to be more efficient as per regex101 website:

~(\S+) \h+ (*SKIP) (?= (?>\S+\h+)*? \1 (?!\S) )~x
Sign up to request clarification or add additional context in comments.

5 Comments

I have no clue why it works, but I kinda does. Thank you in advance. Could please have look at: regex101.com/r/kBxa7R/6 again? There is just two minor issues left. 1. In the street, the . is replaced even when the number is not the same (represented in line 2 in regex101) 2. For the XX(XXXX/XXXXX) part, it should ony replace if the whole string is the same (represented in line 2 and 3 in regex101)
Just for my culture. Can't we just use \b(\S+) instead of (?<!\S)(\S+) ?
Wow, both of them are really good ones @CasimiretHippolyte
@Vincent: Please try (?<!\S)(\S+)\h+(?=(?:\S+\h+)*?\1(?!\S)) as per suggestion of Casimir. And yes we can use \b instead of (?<!\S) also but I think \b might be tad slower.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.