Remove duplicates from rows and columns (cell) in a dataframe, python

Question

I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:

Index   x    y  
  1     1    ec, us, us, gbr, lst
  2     5    ec, us, us, us, us, ec, ec, ec, ec
  3     8    ec, us, us, gbr, lst, lst, lst, lst, gbr
  4     5    ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
  5     7    chn, chn, chn, ec, ec, us, us, gbr, lst

I need to eliminate all the duplicate items an get a resulting dataframe like this:

Index   x    y  
  1     1    ec, us, gbr, lst
  2     5    ec, us
  3     8    ec, us, gbr,lst
  4     5    ec, us, ir
  5     7    chn, ec, us, gbr, lst

Thanks!!

So, what did you already try out in order to get the result you want? — 1313e
– 1313e, Commented Jan 4, 2018 at 4:30
stackoverflow.com/questions/7794208/… mutiple function there, what you need is just apply those to your dataframe — BENY
– BENY, Commented Jan 4, 2018 at 4:55

kağan hazal koçdemir · Accepted Answer · 2021-09-15 09:40:31Z

21

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')

0         us, ec, gbr, lst
1                   us, ec
2         us, ec, gbr, lst
3               us, ec, ir
4    us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)

# Replace all the braces and nan with `''`, then split and apply set and join

edited Sep 15, 2021 at 9:40

kağan hazal koçdemir

7255 silver badges18 bronze badges

answered Jan 4, 2018 at 4:34

Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

PAstudilloE Over a year ago

it works perfect @Dark ... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?

Bharath M Shetty Over a year ago

@PAstudilloE are you saying the y column is like {ec,us.. before running this code or after running this code?

PAstudilloE Over a year ago

before running the code. the original columns are {ec, us, ..., nan} @Dark

PAstudilloE Over a year ago

it works well. The only problem that I have now is that the results I'm getting are like this: , , us, ec... (the nan's are erased but the commas are still there). Do you have any guidance on how to solve that?

kağan hazal koçdemir Over a year ago

For FutureWarning error add regex=True in replace

koPytok · Accepted Answer · 2018-01-04 04:37:34Z

1

Try this:

d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))

answered Jan 4, 2018 at 4:37

koPytok

3,7731 gold badge16 silver badges29 bronze badges

1 Comment

PAstudilloE Over a year ago

it works perfect!... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?

Hans Musgrave · Accepted Answer · 2018-01-04 05:33:21Z

1

If you don't care about item order, and assuming the data type of everything in column y is a string, you can use the following snippet:

df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))

The set() conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.

edited Jan 4, 2018 at 5:33

answered Jan 4, 2018 at 4:36

Hans Musgrave

7,2112 gold badges21 silver badges40 bronze badges

3 Comments

Turn Over a year ago

That call to list isn't needed.

PAstudilloE Over a year ago

I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?

Peter O. Over a year ago

Even in Python 3.10, sets are documented as unordered collections, so they should not be used if the order in which items are inserted or enumerated is important to a program.

srj · Accepted Answer · 2018-01-04 04:44:46Z

0

use the apply method on the dataframe.

# change this function according to your needs
def dedup(row):
    return list(set(row.y))

df['deduped'] = df.apply(dedup, axis=1)

answered Jan 4, 2018 at 4:44

srj

10.2k2 gold badges25 silver badges28 bronze badges

Collectives™ on Stack Overflow

Remove duplicates from rows and columns (cell) in a dataframe, python

4 Answers 4

5 Comments

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related