Finding unique values in pandas column where each row has multiple values

Question

I have following column in a dataframe which contains colors seprated by |

df = pd.DataFrame({'x': ['RED|BROWN|YELLOW', 'WHITE|BLACK|YELLOW|GREEN', 'BLUE|RED|PINK']})

I want to find all unique colors from the column.

Expected Output:

{'YELLOW', 'BLACK', 'RED', 'BLUE', 'BROWN', 'GREEN', 'WHITE', 'PINK'}

I don't mind if it is list or set.

What I tried:

df['x'] = df['x'].apply(lambda x: x.split("|"))

colors = []
for idx, row in df.iterrows():
    colors.extend(row['x'])

print(set(colors))

Which is working fine but I am looking for more efficient solution as I have large dataset.

okay, then you can check with itertools, posted something with that — anky
– anky, Commented Mar 25, 2019 at 5:50

anky · Accepted Answer · 2019-03-25 05:58:21Z

1

Use itertools (which is arguably the fastest in flattening lists ) with set;

import itertools
set(itertools.chain.from_iterable(df.x.str.split('|')))

Output:

{'BLACK', 'BLUE', 'BROWN', 'GREEN', 'PINK', 'RED', 'WHITE', 'YELLOW'}

Another possible solution with functools which is almost as fast as itertools:

import functools
import operator
set(functools.reduce(operator.iadd, df.x.str.split('|'), []))

Note you can also use sum() which seems readable but not quite as fast.

edited Mar 25, 2019 at 5:58

answered Mar 25, 2019 at 5:47

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bubble · Accepted Answer · 2019-03-25 05:34:50Z

1

set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel())

or

set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel()) - set([None])

answered Mar 25, 2019 at 5:34

bubble

1,6721 gold badge12 silver badges17 bronze badges

2 Comments

Sociopath Over a year ago

Why it returns None? can you please explain?

bubble Over a year ago

Since items in column x expanded in arrays (different number of colors per item) of different length some items of the data frame df.loc[:, 'x'].str.split('|', expand=True) becomes None; you need to exclude None from the result

iamklaus · Accepted Answer · 2019-03-25 05:35:52Z

1

list(df.x.str.split('|', expand=True).stack().reset_index(name='x').drop_duplicates('x')['x'])

Output

['RED', 'BROWN', 'YELLOW', 'WHITE', 'BLACK', 'GREEN', 'BLUE', 'PINK']

answered Mar 25, 2019 at 5:35

iamklaus

3,7682 gold badges14 silver badges21 bronze badges

Comments

Nambi_0915 · Accepted Answer · 2019-03-25 05:58:11Z

1

You can also do set(df['x'].str.split('|').values.sum())

This will also remove None form the output

{'YELLOW', 'RED', 'WHITE', 'BROWN', 'GREEN', 'PINK', 'BLUE', 'BLACK'}

answered Mar 25, 2019 at 5:58

Nambi_0915

1,0998 silver badges22 bronze badges

Collectives™ on Stack Overflow

Finding unique values in pandas column where each row has multiple values

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related