2

I am trying to populate a new column in a pandas dataframe based on whether another column in that row contains a string.

For example, I have a list of possible colors:

possible_colors = ['red', 'blue', 'green', orange', 'purple']

A dataframe contains sales data for a hypothetical product. The product name contains a color in its product code, and I would to create a column labeling that product as its proper color.

df = {'product': ['123red309','20424green098','2purple09183'],
          'sales_qty': [20, 5, 10]}

If the product column contains the string 'green' I want to populate a new column Color with the string 'green'.

I tried doing so with the code:

for color in possible_colors:
    df['Color'] = np.where(df.product.str.contains(color),color)

This gives me the warning ValueError: either both or neither of x and y should be given.

My actual dataframe is of course thousands of rows and not just 3, and my list of possible colors is dozens of items.

How can I properly complete task? Thank you!

0

2 Answers 2

1

You can use series.str.extract():

df['color']=df['product'].str.extract(r'({})'.format('|'.join(possible_colors)))
print(df)

         product  sales_qty   color
0      123red309         20     red
1  20424green098          5   green
2   2purple09183         10  purple

Where : r'({})'.format('|'.join(possible_colors)) yeilds: '(red|blue|green|orange|purple)'

Sign up to request clarification or add additional context in comments.

Comments

1

Here is one way:

df['color'] = df['product'].apply(lambda x: ''.join(i for i in possible_colors 
                                                    if i in x) or None)

       product     sales_qty   color
0      123red309         20     red
1  20424green098          5   green
2   2purple09183         10  purple

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.