Remove adjacent duplicate words in a string with Python?

Question

How would I remove adjacent duplicate words in a string. For example 'Hey there There' -> 'Hey there'

stackoverflow.com/questions/7794208/… if you want no duplicate words at all... Or do you only want to remove adjacent duplicates? — ChrisOram
– ChrisOram, Commented Jul 22, 2021 at 7:57

Tim Biegeleisen · Accepted Answer · 2021-07-22 08:01:07Z

10

Using re.sub with a backreference we can try:

inp = 'Hey there There'
output = re.sub(r'(\w+) \1', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

The regex pattern used here says to:

(\w+)  match and capture a word
[ ]    followed by a space
\1     then followed by the same word (ignoring case)

Then, we just replace with the first adjacent word.

edited Jul 22, 2021 at 8:01

answered Jul 22, 2021 at 7:58

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user1655130 Over a year ago

What does r mean above?

Tim Biegeleisen Over a year ago

@user1655130 An r preceding a Python string indicates that it is a raw string. We use raw strings because it can make it easier to write regex, avoiding escaping.

user1655130 Over a year ago

from a learning perspective - how would you do this with recursion?

Tim Biegeleisen Over a year ago

I suggest opening a new question, as using some kind of recursive approach is very different from my current answer (but maybe I can post another answer).

user1655130 Over a year ago

Unfortunately, it wont let me ask a similar question. Thanks for your help

|

ROHIT SHARMA 16110141 · Accepted Answer · 2021-09-08 11:05:25Z

3

inp = 'Hey there There'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

inp = 'Hey there eating?'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there eating?

\b ensures word boundary and captures the entire word instead of character. The second test case ("Hey there eating?") does not work with https://stackoverflow.com/a/68481181/8439676 answer given by Tim Biegeleisen.

edited Sep 8, 2021 at 11:05

answered Sep 7, 2021 at 9:06

ROHIT SHARMA 16110141

312 bronze badges

Comments

Farhad Kabir · Accepted Answer · 2022-11-03 19:48:39Z

0

Remove adjacent duplicate words recursively

   def removeConsecutiveDuplicateWors(s):
        st = s.split()
        if len(st) < 2:
            return " ".join(st)
        if st[0] != st[1]:
            nw =  ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:]))
            return nw
        return removeConsecutiveDuplicateWors(" ".join(st[1:]))
      
    
    string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?'
    print(removeConsecutiveDuplicateWors(string))

output : I am a duplicate word in a sentence. How I can be removed?

answered Nov 3, 2022 at 19:48

Farhad Kabir

612 silver badges5 bronze badges

Comments

H3lix · Accepted Answer · 2023-03-10 16:23:36Z

Rohit Sharma's answer should be accepted, as it does in fact take word boundaries into account. The original answer would incorrectly change Hey there eating to Hey thereating

Alternatively, one could use the following regex (which will produce a slightly different output in some scenarios; see examples below):

my_output = re.sub(r'\b(\w+)(?:\W+\1\b)+', r'\1', my_input, flags=re.IGNORECASE)

Example 1:

INPUT: Buying food food in the supermarket

ROHITS VERSION OUTPUT: Buying food in the supermarket

ABOVE VERSION OUTPUT: Buying food in the supermarket

Example 2:

INPUT: Food: Food and Beverages

ROHITS VERSION OUTPUT: Food: Food and Beverages (unchanged)

ABOVE VERSION OUTPUT: Food and Beverages

Explanation:

“\b”: A word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.

“\w+” A word character: [a-zA-Z_0-9]

“\W+”: A non-word character: [^\w]

“\1”: Matches whatever was matched in the 1st group of parentheses, which in this case is the (\w+)

“+”: Match whatever it's placed after 1 or more times

Credits:

I adapted this code to Python but it originates from this geeksforgeeks.org post

Collectives™ on Stack Overflow

Remove adjacent duplicate words in a string with Python?

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related