12

I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:

Index   x    y  
  1     1    ec, us, us, gbr, lst
  2     5    ec, us, us, us, us, ec, ec, ec, ec
  3     8    ec, us, us, gbr, lst, lst, lst, lst, gbr
  4     5    ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
  5     7    chn, chn, chn, ec, ec, us, us, gbr, lst

I need to eliminate all the duplicate items an get a resulting dataframe like this:

Index   x    y  
  1     1    ec, us, gbr, lst
  2     5    ec, us
  3     8    ec, us, gbr,lst
  4     5    ec, us, ir
  5     7    chn, ec, us, gbr, lst

Thanks!!

2
  • So, what did you already try out in order to get the result you want? Commented Jan 4, 2018 at 4:30
  • stackoverflow.com/questions/7794208/… mutiple function there, what you need is just apply those to your dataframe Commented Jan 4, 2018 at 4:55

4 Answers 4

21

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')

0         us, ec, gbr, lst
1                   us, ec
2         us, ec, gbr, lst
3               us, ec, ir
4    us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)

# Replace all the braces and nan with `''`, then split and apply set and join
Sign up to request clarification or add additional context in comments.

5 Comments

it works perfect @Dark ... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?
@PAstudilloE are you saying the y column is like {ec,us.. before running this code or after running this code?
before running the code. the original columns are {ec, us, ..., nan} @Dark
it works well. The only problem that I have now is that the results I'm getting are like this: , , us, ec... (the nan's are erased but the commas are still there). Do you have any guidance on how to solve that?
For FutureWarning error add regex=True in replace
1

Try this:

d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))

1 Comment

it works perfect!... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?
1

If you don't care about item order, and assuming the data type of everything in column y is a string, you can use the following snippet:

df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))

The set() conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.

3 Comments

That call to list isn't needed.
I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?
Even in Python 3.10, sets are documented as unordered collections, so they should not be used if the order in which items are inserted or enumerated is important to a program.
0

use the apply method on the dataframe.

# change this function according to your needs
def dedup(row):
    return list(set(row.y))

df['deduped'] = df.apply(dedup, axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.