Using np.select to generate conditional column based off data from multiple other columns

Question

I am trying to generate a new column on my existing dataframe that is built off conditional statements with the input being data from multiple columns in the dataframe.

I'm using the np.select() method as I read this is the best way to use multiple columns as inputs to levels of conditions. However, when I run the code, the default value is populated, even when criteria in the rows is met. Below is some exampel code

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,2, size=(20,3)), columns = list('ABC'))

choices = ['C Highest','B Highest','A Highest']
conditions = [
        (df['C'] is True), 
        (df['C'] is False & df['B'] is True),
        (df['A'] is True & df['C']is False & df['B'] is False)]

#conditions = [
#        (df['C'] == 1), 
#        (df['C'] == 0 & df['B'] == 1),
#        (df['A'] == 1 & df['C'] == 0 & df['B'] == 0)]

df['Highest Column'] = np.select(conditions, choices, default=np.nan)

When I run the above code, I get no errors, but the Highest Column in the dataframe is all NaN. It's as if the code works, but none of the conditions seem to be met (despite them being true) so only the default value is populated.

When I switch the conditions to the one that's commented out (and then comment out the previous conditions variable), I get "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

Obviously this data is just random and abstracted from my use case, but the underlying code should be nearly identical. If there is a 1 in Column C, it should be marked as Column C in the Highest Column Series in the Dataframe. If Column C is 0, but B has a 1, then Highest should be Column B. etc etc.

I know I can do this in excel really quickly, but I'd much rather learn how to do this in Python/pandas, so any advice is much appreciated!

You forgot the brackets in your outcommented conditions: (df['C'] == 0) & (df['B'] == 1), — Erfan
– Erfan, Commented Aug 9, 2019 at 22:13

Hryhorii Pavlenko · Accepted Answer · 2019-08-09 22:08:15Z

4

Try:

choices = ['C Highest','B Highest','A Highest']
conditions = [
       (df['C'] == 1), 
       ((df['C'] == 0) & (df['B'] == 1)),
       ((df['A'] == 1) & (df['C'] == 0) & (df['B'] == 0))]

df['Highest Column'] = np.select(conditions, choices, default=np.nan)

# df.head()

    A   B   C   Highest Column
0   1   0   0   A Highest
1   0   0   1   C Highest
2   1   1   0   B Highest
3   1   0   1   C Highest
4   1   1   0   B Highest

answered Aug 9, 2019 at 22:08

Hryhorii Pavlenko

3,9104 gold badges21 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Using np.select to generate conditional column based off data from multiple other columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related