2

I have following column in a dataframe which contains colors seprated by |

df = pd.DataFrame({'x': ['RED|BROWN|YELLOW', 'WHITE|BLACK|YELLOW|GREEN', 'BLUE|RED|PINK']})

I want to find all unique colors from the column.

Expected Output:

{'YELLOW', 'BLACK', 'RED', 'BLUE', 'BROWN', 'GREEN', 'WHITE', 'PINK'}

I don't mind if it is list or set.

What I tried:

df['x'] = df['x'].apply(lambda x: x.split("|"))

colors = []
for idx, row in df.iterrows():
    colors.extend(row['x'])

print(set(colors))

Which is working fine but I am looking for more efficient solution as I have large dataset.

3
  • is order important? Commented Mar 25, 2019 at 5:48
  • 1
    @anky_91 Nope that's why I'm ok with either list or set Commented Mar 25, 2019 at 5:49
  • okay, then you can check with itertools, posted something with that Commented Mar 25, 2019 at 5:50

4 Answers 4

1

Use itertools (which is arguably the fastest in flattening lists ) with set;

import itertools
set(itertools.chain.from_iterable(df.x.str.split('|')))

Output:

{'BLACK', 'BLUE', 'BROWN', 'GREEN', 'PINK', 'RED', 'WHITE', 'YELLOW'}

Another possible solution with functools which is almost as fast as itertools:

import functools
import operator
set(functools.reduce(operator.iadd, df.x.str.split('|'), []))

Note you can also use sum() which seems readable but not quite as fast.

Sign up to request clarification or add additional context in comments.

Comments

1
set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel())

or

set(df.loc[:, 'x'].str.split('|', expand=True).values.ravel()) - set([None])

2 Comments

Why it returns None? can you please explain?
Since items in column x expanded in arrays (different number of colors per item) of different length some items of the data frame df.loc[:, 'x'].str.split('|', expand=True) becomes None; you need to exclude None from the result
1
list(df.x.str.split('|', expand=True).stack().reset_index(name='x').drop_duplicates('x')['x'])

Output

['RED', 'BROWN', 'YELLOW', 'WHITE', 'BLACK', 'GREEN', 'BLUE', 'PINK']

Comments

1

You can also do set(df['x'].str.split('|').values.sum())

This will also remove None form the output

{'YELLOW', 'RED', 'WHITE', 'BROWN', 'GREEN', 'PINK', 'BLUE', 'BLACK'}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.