Remove duplicates from python dataframe list

Question

I have a pandas df where each row is a list of words. The list has duplicate words. I want to remove duplicate words.

I tried using dict.fromkeys(listname) in a for loop to iterate over each row in the df. But this splits the words into alphabets

filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')

df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
    l = df["text_lemmatized"][i]
    df["newlist"][i] = list(dict.fromkeys(l))

print(df)

Expected result is ==>

['clear', 'pending', 'order', 'pending', 'order']   ['clear', 'pending', 'order']
 ['pending', 'activation', 'clear', 'pending']   ['pending', 'activation', 'clear']

Actual result is

['clear', 'pending', 'order', 'pending', 'order']  ...   [[, ', c, l, e, a, r, ,,  , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ...  ...  [[, ', p, e, n, d, i, g, ,,  , a, c, t, v, o, ...

Anthony Kong · Accepted Answer · 2019-07-19 07:06:01Z

5

Use set to remove duplicates.

Also you don't need the for loop

  df["newlist"] = list(set( df["text_lemmatized"] ))

answered Jul 19, 2019 at 7:06

Anthony Kong

41.4k52 gold badges192 silver badges325 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Anoop Mahajan Over a year ago

I get the error...ValueError: Length of values does not match length of index

Anthony Kong Over a year ago

What error? You need to provide at least the error message

Anoop Mahajan Over a year ago

Error I get is "ValueError: Length of values does not match length of index "

Andy L. · Accepted Answer · 2019-07-19 07:24:52Z

4

Just use series.map and np.unique

Your sample data:

Out[43]:
                           text_lemmatized
0  [clear, pending, order, pending, order]
1    [pending, activation, clear, pending]

df.text_lemmatized.map(np.unique)

Out[44]:
    0         [clear, order, pending]
    1    [activation, clear, pending]
    Name: val, dtype: object

If you prefer it isn't sorted, use pd.unique

df.text_lemmatized.map(pd.unique)

Out[51]:
0         [clear, pending, order]
1    [pending, activation, clear]
Name: text_lemmatized, dtype: object

edited Jul 19, 2019 at 7:24

answered Jul 19, 2019 at 7:16

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

2 Comments

Anoop Mahajan Over a year ago

There is no error but it does not remove the duplicates. Does not work.

Andy L. Over a year ago

@AnoopMahajan: it's weird! it works on my system with your sample. Have you assign it back as df['newlist'] = df.text_lemmatized.map(pd.unique)

aryan singh · Accepted Answer · 2019-07-19 07:06:28Z

0

df.drop_duplicates(subset ="text_lemmatized", 
                     keep = First, inplace = True)

keep = First, means keep the first occurrence

answered Jul 19, 2019 at 7:06

aryan singh

16113 bronze badges

1 Comment

Anoop Mahajan Over a year ago

This does not work and gives me the same error: "ValueError: Length of values does not match length of index"...

J.Dan · Accepted Answer · 2019-07-19 07:40:58Z

0

Your code for removing duplicates seems fine. I tried following and it worked well. Guess the problem is the way you are appending the list in the dataframe column.

`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
            ['pending', 'activation', 'clear', 'pending']] 

list_with_unique_words = []

for x in list_from_df:

    unique_words = list(dict.fromkeys(x))
    list_with_unique_words.append(unique_words)

print(list_with_unique_words)

output [['clear', 'pending', 'order'], ['pending', 'activation', 'clear']]

df["newlist"] = list_with_unique_words

df

`

edited Jul 19, 2019 at 7:40

answered Jul 19, 2019 at 7:33

J.Dan

511 silver badge6 bronze badges

1 Comment

Anoop Mahajan Over a year ago

This does not seem to work and gives me the same error as before where it splits each word into alphabets

jezrael · Accepted Answer · 2019-07-19 11:36:31Z

0

Problem is there are not lists, but strings, so is necessary convert each value to list by ast.literal_eval, then is possible convert values to sets for remove duplicates:

import ast

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
                           text_lemmatized                       newlist
0  [clear, pending, order, pending, order]       [clear, pending, order]
1    [pending, activation, clear, pending]  [clear, activation, pending]

Or use dict.fromkeys:

f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)

Another idea is convert column text_lemmatized to lists in one step and then remove duplicates in another step, advantage is lists in column text_lemmatized for next processing:

df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

EDIT:

After some discussion solution is:

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

edited Jul 19, 2019 at 11:36

answered Jul 19, 2019 at 7:32

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

8 Comments

Anoop Mahajan Over a year ago

I am sorry jezrael...When i incorporated this in my full code, it is now giving me an error as "ValueError: malformed node or string: ['clear', 'pending', 'order', 'pending', 'order']

jezrael Over a year ago

@AnoopMahajan - How working last solution with change df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval) to df['text_lemmatized'] = df['text_lemmatized'].str.strip("[]").str.split(', ') ?

Anoop Mahajan Over a year ago

tried the following, df["newlist"] = df["text_lemmatized"] df['text_lemmatized'] = df['text_lemmatized'].str.strip("[]").str.split(', ') df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))...I get an error as "AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas"

jezrael Over a year ago

@AnoopMahajan - What is print (type(df['text_lemmatized'].iat[0])) ?

Anoop Mahajan Over a year ago

print returns <class 'list'>

|

Anoop Mahajan · Accepted Answer · 2019-07-19 12:17:43Z

0

Solution is ==>

import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)

Thanks to jezrael and all others who helped narrow down to this solution

answered Jul 19, 2019 at 12:17

Anoop Mahajan

912 silver badges12 bronze badges

Collectives™ on Stack Overflow

Remove duplicates from python dataframe list

6 Answers 6

3 Comments

2 Comments

1 Comment

1 Comment

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

2 Comments

1 Comment

1 Comment

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related