2

I have a pandas df where each row is a list of words. The list has duplicate words. I want to remove duplicate words.

I tried using dict.fromkeys(listname) in a for loop to iterate over each row in the df. But this splits the words into alphabets

filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')

df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
    l = df["text_lemmatized"][i]
    df["newlist"][i] = list(dict.fromkeys(l))

print(df)

Expected result is ==>

['clear', 'pending', 'order', 'pending', 'order']   ['clear', 'pending', 'order']
 ['pending', 'activation', 'clear', 'pending']   ['pending', 'activation', 'clear']

Actual result is

['clear', 'pending', 'order', 'pending', 'order']  ...   [[, ', c, l, e, a, r, ,,  , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ...  ...  [[, ', p, e, n, d, i, g, ,,  , a, c, t, v, o, ...

6 Answers 6

5

Use set to remove duplicates.

Also you don't need the for loop

  df["newlist"] = list(set( df["text_lemmatized"] ))
Sign up to request clarification or add additional context in comments.

3 Comments

I get the error...ValueError: Length of values does not match length of index
What error? You need to provide at least the error message
Error I get is "ValueError: Length of values does not match length of index "
4

Just use series.map and np.unique

Your sample data:

Out[43]:
                           text_lemmatized
0  [clear, pending, order, pending, order]
1    [pending, activation, clear, pending]

df.text_lemmatized.map(np.unique)

Out[44]:
    0         [clear, order, pending]
    1    [activation, clear, pending]
    Name: val, dtype: object

If you prefer it isn't sorted, use pd.unique

df.text_lemmatized.map(pd.unique)

Out[51]:
0         [clear, pending, order]
1    [pending, activation, clear]
Name: text_lemmatized, dtype: object

2 Comments

There is no error but it does not remove the duplicates. Does not work.
@AnoopMahajan: it's weird! it works on my system with your sample. Have you assign it back as df['newlist'] = df.text_lemmatized.map(pd.unique)
0
df.drop_duplicates(subset ="text_lemmatized", 
                     keep = First, inplace = True) 

keep = First, means keep the first occurrence

1 Comment

This does not work and gives me the same error: "ValueError: Length of values does not match length of index"...
0

Your code for removing duplicates seems fine. I tried following and it worked well. Guess the problem is the way you are appending the list in the dataframe column.

`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
            ['pending', 'activation', 'clear', 'pending']] 

list_with_unique_words = []

for x in list_from_df:

    unique_words = list(dict.fromkeys(x))
    list_with_unique_words.append(unique_words)

print(list_with_unique_words)

output [['clear', 'pending', 'order'], ['pending', 'activation', 'clear']]

df["newlist"] = list_with_unique_words

df

`

my final df

1 Comment

This does not seem to work and gives me the same error as before where it splits each word into alphabets
0

Problem is there are not lists, but strings, so is necessary convert each value to list by ast.literal_eval, then is possible convert values to sets for remove duplicates:

import ast

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
                           text_lemmatized                       newlist
0  [clear, pending, order, pending, order]       [clear, pending, order]
1    [pending, activation, clear, pending]  [clear, activation, pending]

Or use dict.fromkeys:

f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)

Another idea is convert column text_lemmatized to lists in one step and then remove duplicates in another step, advantage is lists in column text_lemmatized for next processing:

df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

EDIT:

After some discussion solution is:

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

8 Comments

I am sorry jezrael...When i incorporated this in my full code, it is now giving me an error as "ValueError: malformed node or string: ['clear', 'pending', 'order', 'pending', 'order']
@AnoopMahajan - How working last solution with change df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval) to df['text_lemmatized'] = df['text_lemmatized'].str.strip("[]").str.split(', ') ?
tried the following, df["newlist"] = df["text_lemmatized"] df['text_lemmatized'] = df['text_lemmatized'].str.strip("[]").str.split(', ') df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))...I get an error as "AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas"
@AnoopMahajan - What is print (type(df['text_lemmatized'].iat[0])) ?
print returns <class 'list'>
|
0

Solution is ==>

import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)

Thanks to jezrael and all others who helped narrow down to this solution

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.