How can I remove duplicate words in a string with Python?

Question

Following example:

string1 = "calvin klein design dress calvin klein"

How can I remove the second two duplicates "calvin" and "klein"?

The result should look like

string2 = "calvin klein design dress"

only the second duplicates should be removed and the sequence of the words should not be changed!

Markus · Accepted Answer · 2019-01-02 11:57:49Z

59

string1 = "calvin klein design dress calvin klein"
words = string1.split()
print (" ".join(sorted(set(words), key=words.index)))

This sorts the set of all the (unique) words in your string by the word's index in the original list of words.

edited Jan 2, 2019 at 11:57

answered Oct 17, 2011 at 13:40

Markus

3,5973 gold badges26 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

spicavigo · Accepted Answer · 2011-10-17 13:12:56Z

27

def unique_list(l):
    ulist = []
    [ulist.append(x) for x in l if x not in ulist]
    return ulist

a="calvin klein design dress calvin klein"
a=' '.join(unique_list(a.split()))

answered Oct 17, 2011 at 13:12

spicavigo

4,2441 gold badge25 silver badges28 bronze badges

7 Comments

Petr Viktorin Over a year ago

Unfortunately it's O(N²) – the in goes through the whole ulist each time. Don't use it for long lists.

spicavigo Over a year ago

Thanks Pablo. I found that list comprehension part about 2 years ago on SO itself. Have been using it ever since.

spicavigo Over a year ago

@Petr. Thats true. I provided it here under the assumption that the list is not going to be too long.

Markus Over a year ago

I find your use of append in a list comprehension disturbing.

Chris Morgan Over a year ago

A list comprehension is inappropriate and should not be used unless you're using the output. Use a proper for x in l: if x not in ulist: ulist.append(x).

|

NPE · Accepted Answer · 2011-10-17 13:27:25Z

13

In Python 2.7+, you could use collections.OrderedDict for this:

from collections import OrderedDict
s = "calvin klein design dress calvin klein"
print ' '.join(OrderedDict((w,w) for w in s.split()).keys())

edited Oct 17, 2011 at 13:27

answered Oct 17, 2011 at 13:21

NPE

503k114 gold badges970 silver badges1k bronze badges

1 Comment

ekhumoro Over a year ago

' '.join(OrderedDict.fromkeys(s.split())).

Lauritz V. Thaulow · Accepted Answer · 2011-10-17 13:44:25Z

8

Cut and paste from the itertools recipes

from itertools import ifilterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

I really wish they could go ahead and make a module out of those recipes soon. I'd very much like to be able to do from itertools_recipes import unique_everseen instead of using cut-and-paste every time I need something.

Use like this:

def unique_words(string, ignore_case=False):
    key = None
    if ignore_case:
        key = str.lower
    return " ".join(unique_everseen(string.split(), key=key))

string2 = unique_words(string1)

edited Oct 17, 2011 at 13:44

answered Oct 17, 2011 at 13:22

Lauritz V. Thaulow

51.3k13 gold badges76 silver badges94 bronze badges

3 Comments

Markus Over a year ago

I timed a few of these… this one is very fast, even for long lists.

Petr Viktorin Over a year ago

@lazyr: As for your wish, it turns out you can do exactly that. Just install the package from PyPI.

Lauritz V. Thaulow Over a year ago

@Petr This news does not suprise me in the slightest. I'd be amazed if there weren't a PyPI package for just that. What I meant was that it should be part of the included batteries in python, since they are used so frequently. I'm rather puzzled as to why they're not.

Andrey Topoleov · Accepted Answer · 2018-11-09 10:02:42Z

7

string2 = ' '.join(set(string1.split()))

Explanation:

.split() - it is a method to split string to list (without params it split by spaces)
set() - it is type of unordered collections that exclude dublicates
'separator'.join(list) - mean that you want to join list from params to string with 'separator' between elements

edited Nov 9, 2018 at 10:02

answered Nov 9, 2018 at 8:33

Andrey Topoleov

2,00919 silver badges23 bronze badges

3 Comments

hellow Over a year ago

While this might answer the authors question, it lacks some explaining words and/or links to documentation. Raw code snippets are not very helpful without some phrases around them. You may also find how to write a good answer very helpful. Please edit your answer.

parvus Over a year ago

This potentially changes the order of the words in the string.

Tomas Pytel Over a year ago

This will not remove duplicates if you want to split on other element than space. fe: "cisco, cisco systems, cisco".join(set(a.split())) will output: 'cisco, systems, cisco'

ekhumoro · Accepted Answer · 2011-10-17 13:27:56Z

5

string = 'calvin klein design dress calvin klein'

def uniquify(string):
    output = []
    seen = set()
    for word in string.split():
        if word not in seen:
            output.append(word)
            seen.add(word)
    return ' '.join(output)

print uniquify(string)

answered Oct 17, 2011 at 13:27

ekhumoro

122k23 gold badges272 silver badges400 bronze badges

Comments

Pablo Santa Cruz · Accepted Answer · 2011-10-17 13:17:54Z

2

You can use a set to keep track of already processed words.

words = set()
result = ''
for word in string1.split():
    if word not in words:
        result = result + word + ' '
        words.add(word)
print result

edited Oct 17, 2011 at 13:17

answered Oct 17, 2011 at 13:10

Pablo Santa Cruz

182k33 gold badges250 silver badges300 bronze badges

2 Comments

Petr Viktorin Over a year ago

Note that set is a built-in type. No need to import it (unless you use an ancient version of Python).

Lauritz V. Thaulow Over a year ago

You should make result a list, append the words to it, and then return " ".join(result) in the end. This is much more efficient.

Chris Phillips · Accepted Answer · 2011-10-17 22:13:16Z

Several answers are pretty close to this but haven't quite ended up where I did:

def uniques( your_string ):    
    seen = set()
    return ' '.join( seen.add(i) or i for i in your_string.split() if i not in seen )

Of course, if you want it a tiny bit cleaner or faster, we can refactor a bit:

def uniques( your_string ):    
    words = your_string.split()

    seen = set()
    seen_add = seen.add

    def add(x):
        seen_add(x)  
        return x

    return ' '.join( add(i) for i in words if i not in seen )

I think the second version is about as performant as you can get in a small amount of code. (More code could be used to do all the work in a single scan across the input string but for most workloads, this should be sufficient.)

Soudipta Dutta · Accepted Answer · 2018-06-16 23:44:45Z

1

Question: Remove the duplicates in a string

 from _collections import OrderedDict

    a = "Gina Gini Gini Protijayi"

    aa = OrderedDict().fromkeys(a.split())
    print(' '.join(aa))
   # output => Gina Gini Protijayi

answered Jun 16, 2018 at 23:44

Soudipta Dutta

2,0701 gold badge16 silver badges11 bronze badges

1 Comment

Dimitris Paraschakis Over a year ago

Starting from Python 3.7, insertion order is guaranteed in dicts. So no need for OrderedDict.

Sulman Malik · Accepted Answer · 2020-06-08 12:04:07Z

1

Use numpy function make an import its better to have an alias for the import (as np)

import numpy as np

and then you can bing it like this for removing duplicates from array you can use it this way

no_duplicates_array = np.unique(your_array)

for your case if you want result in string you can use

no_duplicates_string = ' '.join(np.unique(your_string.split()))

answered Jun 8, 2020 at 12:04

Sulman Malik

1451 gold badge1 silver badge8 bronze badges

Comments

Okroshiashvili · Accepted Answer · 2022-12-01 14:24:54Z

1

To remove duplicate words from sentence and preserve the order of the words you can use dict.fromkeys method.

string1 = "calvin klein design dress calvin klein"

words = string1.split()

result = " ".join(list(dict.fromkeys(words)))

print(result)

answered Dec 1, 2022 at 14:24

Okroshiashvili

4,1892 gold badges31 silver badges46 bronze badges

Comments

the chib · Accepted Answer · 2016-04-17 16:38:34Z

0

11 and 2 work perfectly:

    s="the sky is blue very blue"
    s=s.lower()
    slist = s.split()
    print " ".join(sorted(set(slist), key=slist.index))

and 2

    s="the sky is blue very blue"
    s=s.lower()
    slist = s.split()
    print " ".join(sorted(set(slist), key=slist.index))

answered Apr 17, 2016 at 16:38

the chib

1

1 Comment

xuanyue Over a year ago

How is this key argument work? I couldn't find it in the documentation.

rahul ranjan · Accepted Answer · 2018-06-25 07:22:02Z

0

You can remove duplicate or repeated words from a text file or string using following codes -

from collections import Counter
for lines in all_words:

    line=''.join(lines.lower())
    new_data1=' '.join(lemmatize_sentence(line))
    new_data2 = word_tokenize(new_data1)
    new_data3=nltk.pos_tag(new_data2)

    # below code is for removal of repeated words

    for i in range(0, len(new_data3)):
        new_data3[i] = "".join(new_data3[i])
    UniqW = Counter(new_data3)
    new_data5 = " ".join(UniqW.keys())
    print (new_data5)


    new_data.append(new_data5)


print (new_data)

P.S. -Do identations as per required. Hope this helps!!!

answered Jun 25, 2018 at 7:22

rahul ranjan

75 bronze badges

Comments

Taazar · Accepted Answer · 2020-03-06 13:45:47Z

0

Without using the split function (will help in interviews)

def unique_words2(a):
    words = []
    spaces = ' '
    length = len(a)
    i = 0
    while i < length:
        if a[i] not in spaces:
            word_start = i
            while i < length and a[i] not in spaces:
                i += 1
            words.append(a[word_start:i])
        i += 1
    words_stack = []
    for val in words:  #
        if val not in words_stack:  # We can replace these three lines with this one -> [words_stack.append(val) for val in words if val not in words_stack]
            words_stack.append(val)  #
    print(' '.join(words_stack))  # or return, your choice


unique_words2('calvin klein design dress calvin klein')

edited Mar 6, 2020 at 13:45

Taazar

1,50318 silver badges27 bronze badges

answered Mar 6, 2020 at 12:06

Rishbh Verma

113 bronze badges

Comments

Peter Csala · Accepted Answer · 2021-10-17 12:51:24Z

0

initializing list

listA = [ 'xy-xy', 'pq-qr', 'xp-xp-xp', 'dd-ee']

print("Given list : ",listA)

using `set()` and `split()`

res = [set(sub.split('-')) for sub in listA]

Result

print("List after duplicate removal :", res)

edited Oct 17, 2021 at 12:51

Peter Csala

23.7k16 gold badges51 silver badges96 bronze badges

answered Oct 17, 2021 at 12:39

Donthamsetti Nithya Sri Devi

1

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Just Me · Accepted Answer · 2023-08-21 09:21:54Z

import re

# Calea către fișierul tău
file_path = "g:\Pyton+ChatGPT\dictionar_no_duplicates.txt"

# Citește conținutul fișierului
with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

# Elimină cuvintele duplicate
result = re.sub(r'\b(\w+)\b(?=.*\b\1\b)', '', text)

# Elimină spații suplimentare sau virgule consecutive
result = re.sub(r'\s+', ' ', result).strip().replace(" ,", ",")

# Rescrie fișierul cu conținutul fără duplicate
with open(file_path, "w", encoding="utf-8") as file:
    file.write(result)

OR THIS

def remove_duplicates(words):
    words_stack = []
    for val in words:
        if val not in words_stack:
            words_stack.append(val)
    return words_stack

input_file = r'g:\Pyton+ChatGPT\dictionar.txt'
output_file = r'g:\Pyton+ChatGPT\dictionar_no_duplicates.txt'

with open(input_file, 'r', encoding='utf-8') as f:
    words = f.read().splitlines()

unique_words = remove_duplicates(words)

with open(output_file, 'w', encoding='utf-8') as f:
    for word in unique_words:
        f.write(word + '\n')

print("Duplicate removal completed.")

OR THIS

import re

# Calea către fișierul tău
file_path = "g:\Pyton+ChatGPT\dictionar_no_duplicates.txt"

# Citește conținutul fișierului
with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

# Crează o listă pentru cuvintele eliminate
removed_words = []

# Funcție callback pentru a adăuga cuvintele duplicate în listă
def replace_and_collect(match):
    word = match.group(1)
    if word not in removed_words:
        removed_words.append(word)
    return ''

# Elimină cuvintele duplicate și virgula asociată folosind funcția callback
result = re.sub(r'\b(\w+)\b,?(?=.*\b\1\b)', replace_and_collect, text)

# Elimină spații suplimentare sau virgule consecutive
result = re.sub(r'\s+', ' ', result).strip().replace(" ,", ",").strip(", ")

# Rescrie fișierul cu conținutul fără duplicate
with open(file_path, "w", encoding="utf-8") as file:
    file.write(result)

# Afișează informații despre cuvintele eliminate
print(f"Numărul de cuvinte duplicate eliminate: {len(removed_words)}")
print(f"Cuvintele eliminate: {', '.join(removed_words)}")

Mffd4n1 · Accepted Answer · 2020-10-09 09:36:47Z

-1

You can do that simply by getting the set associated to the string, which is a mathematical object containing no repeated elements by definition. It suffices to join the words in the set back into a string:

def remove_duplicate_words(string):
        x = string.split()
        x = sorted(set(x), key = x.index)
        return ' '.join(x)

edited Oct 9, 2020 at 9:36

answered Nov 9, 2018 at 8:28

Mffd4n1

11 bronze badge

3 Comments

hellow Over a year ago

While this might answer the authors question, it lacks some explaining words and/or links to documentation. Raw code snippets are not very helpful without some phrases around them. You may also find how to write a good answer very helpful. Please edit your answer.

parvus Over a year ago

This potentially changes the order of the words in the string.

Mffd4n1 Over a year ago

Thanks @parvus I have modified my answer

Collectives™ on Stack Overflow

How can I remove duplicate words in a string with Python?

17 Answers 17

Comments

7 Comments

1 Comment

3 Comments

3 Comments

Comments

2 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

initializing list

using `set()` and `split()`

Result

1 Comment

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

17 Answers 17

Comments

7 Comments

1 Comment

3 Comments

3 Comments

Comments

2 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

initializing list

using set() and split()

Result

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

using `set()` and `split()`