Pandas drop_duplicates method not working on dataframe containing lists

Question

I am trying to use drop_duplicates method on my dataframe, but I am getting an error. See the following:

error: TypeError: unhashable type: 'list'

The code I am using:

df = db.drop_duplicates()

My DB is huge and contains strings, floats, dates, NaN's, booleans, integers... Any help is appreciated.

Apparently, it contains lists which is causing the error. Generally, I consider a DataFrame of lists to be code smell... — juanpa.arrivillaga
– juanpa.arrivillaga, Commented May 8, 2017 at 19:08
I know this is a while ago, but care to elaborate why df containing a list to be a code smell? — KubiK888
– KubiK888, Commented Apr 29, 2021 at 17:05

Pepino · Accepted Answer · 2019-06-06 13:10:48Z

71

drop_duplicates won't work with lists in your dataframe as the error message implies. However, you can drop duplicates on the dataframe casted as str and then extract the rows from original df using the index from the results.

Setup

df = pd.DataFrame({'Keyword': {0: 'apply', 1: 'apply', 2: 'apply', 3: 'terms', 4: 'terms'},
 'X': {0: [1, 2], 1: [1, 2], 2: 'xy', 3: 'xx', 4: 'yy'},
 'Y': {0: 'yy', 1: 'yy', 2: 'yx', 3: 'ix', 4: 'xi'}})

#Drop directly causes the same error
df.drop_duplicates()
Traceback (most recent call last):
...
TypeError: unhashable type: 'list'

Solution

#convert hte df to str type, drop duplicates and then select the rows from original df.

df.loc[df.astype(str).drop_duplicates().index]
Out[205]: 
  Keyword       X   Y
0   apply  [1, 2]  yy
2   apply      xy  yx
3   terms      xx  ix
4   terms      yy  xi

#the list elements are still list in the final results.
df.loc[df.astype(str).drop_duplicates().index].loc[0,'X']
Out[207]: [1, 2]

Edit: replaced iloc with loc. In this particular case, both work as the index matches the positional index, but it is not general

edited Jun 6, 2019 at 13:10

Pepino

3502 silver badges12 bronze badges

answered May 8, 2017 at 19:36

Allen Qin

20k9 gold badges55 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Madhi Over a year ago

@Allen How do you add the Ipython code block in the StackOverflow? I am searching this method all over the internet but couldn't able to find the good solution.

PSK Over a year ago

This answer would not account for cases where two lists in the same column in different rows contain the same elements but in varying order. I guess it also depends on whether the user wants to treat lists with same elements but varying order as duplicates or not.

Hsgao · Accepted Answer · 2020-01-19 10:26:35Z

20

@Allen's answer is great, but have a little problem.

df.iloc[df.astype(str).drop_duplicates().index]

it should be loc not iloc.loot at the example.

a = pd.DataFrame([['a',18],['b',11],['a',18]],index=[4,6,8])
Out[52]: 
   0   1
4  a  18
6  b  11
8  a  18

a.iloc[a.astype(str).drop_duplicates().index]
Out[53]:
...
IndexError: positional indexers are out-of-bounds

a.loc[a.astype(str).drop_duplicates().index]
Out[54]: 
   0   1
4  a  18
6  b  11

edited Jan 19, 2020 at 10:26

answered Sep 6, 2018 at 7:18

Hsgao

5635 silver badges18 bronze badges

Comments

Peter Erichsen · Accepted Answer · 2021-09-09 13:23:39Z

5

I also just want to mention (in case someone else is as stupid as I was), that you will get the same error if you mistakenly give a list of lists as the 'subset' argument for the drop_duplicates function.

Turns out I spend hours looking for a list that wasn't in my dataframe all because I put one to many brackets in my parameters.

answered Sep 9, 2021 at 13:23

Peter Erichsen

511 silver badge2 bronze badges

Comments

ListenSoftware Louise Ai Agent · Accepted Answer · 2020-08-11 21:02:33Z

1

Overview: you can see which rows are duplicated

Method 1:

df2=df.copy()
mylist=df2.iloc[0,1]
df2.iloc[0,1]=' '.join(map(str,mylist))

mylist=df2.iloc[1,1]
df2.iloc[1,1]=' '.join(map(str,mylist))

duplicates=df2.duplicated(keep=False)
print(df2[duplicates])

Method 2:

print(df.astype(str).duplicated(keep=False))

answered Aug 11, 2020 at 21:02

ListenSoftware Louise Ai Agent

4,3432 gold badges31 silver badges39 bronze badges

Collectives™ on Stack Overflow

Pandas drop_duplicates method not working on dataframe containing lists

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related