1

How would I remove adjacent duplicate words in a string. For example 'Hey there There' -> 'Hey there'

2
  • 2
    stackoverflow.com/questions/7794208/… if you want no duplicate words at all... Or do you only want to remove adjacent duplicates? Commented Jul 22, 2021 at 7:57
  • These words are not adjacent though Commented Jul 22, 2021 at 7:58

4 Answers 4

10

Using re.sub with a backreference we can try:

inp = 'Hey there There'
output = re.sub(r'(\w+) \1', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

The regex pattern used here says to:

(\w+)  match and capture a word
[ ]    followed by a space
\1     then followed by the same word (ignoring case)

Then, we just replace with the first adjacent word.

Sign up to request clarification or add additional context in comments.

6 Comments

What does r mean above?
@user1655130 An r preceding a Python string indicates that it is a raw string. We use raw strings because it can make it easier to write regex, avoiding escaping.
from a learning perspective - how would you do this with recursion?
I suggest opening a new question, as using some kind of recursive approach is very different from my current answer (but maybe I can post another answer).
Unfortunately, it wont let me ask a similar question. Thanks for your help
|
3
inp = 'Hey there There'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

inp = 'Hey there eating?'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there eating?

\b ensures word boundary and captures the entire word instead of character. The second test case ("Hey there eating?") does not work with https://stackoverflow.com/a/68481181/8439676 answer given by Tim Biegeleisen.

Comments

0

Remove adjacent duplicate words recursively

   def removeConsecutiveDuplicateWors(s):
        st = s.split()
        if len(st) < 2:
            return " ".join(st)
        if st[0] != st[1]:
            nw =  ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:]))
            return nw
        return removeConsecutiveDuplicateWors(" ".join(st[1:]))
      
    
    string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?'
    print(removeConsecutiveDuplicateWors(string))  

output : I am a duplicate word in a sentence. How I can be removed?

Comments

0

Rohit Sharma's answer should be accepted, as it does in fact take word boundaries into account. The original answer would incorrectly change Hey there eating to Hey thereating

Alternatively, one could use the following regex (which will produce a slightly different output in some scenarios; see examples below):

my_output = re.sub(r'\b(\w+)(?:\W+\1\b)+', r'\1', my_input, flags=re.IGNORECASE)

Example 1:

INPUT: Buying food food in the supermarket

ROHITS VERSION OUTPUT: Buying food in the supermarket

ABOVE VERSION OUTPUT: Buying food in the supermarket

Example 2:

INPUT: Food: Food and Beverages

ROHITS VERSION OUTPUT: Food: Food and Beverages (unchanged)

ABOVE VERSION OUTPUT: Food and Beverages

Explanation:

“\b”: A word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.

“\w+” A word character: [a-zA-Z_0-9]

“\W+”: A non-word character: [^\w]

“\1”: Matches whatever was matched in the 1st group of parentheses, which in this case is the (\w+)

“+”: Match whatever it's placed after 1 or more times

Credits:

I adapted this code to Python but it originates from this geeksforgeeks.org post

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.