0

I am trying to generate a new column on my existing dataframe that is built off conditional statements with the input being data from multiple columns in the dataframe.

I'm using the np.select() method as I read this is the best way to use multiple columns as inputs to levels of conditions. However, when I run the code, the default value is populated, even when criteria in the rows is met. Below is some exampel code

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,2, size=(20,3)), columns = list('ABC'))

choices = ['C Highest','B Highest','A Highest']
conditions = [
        (df['C'] is True), 
        (df['C'] is False & df['B'] is True),
        (df['A'] is True & df['C']is False & df['B'] is False)]

#conditions = [
#        (df['C'] == 1), 
#        (df['C'] == 0 & df['B'] == 1),
#        (df['A'] == 1 & df['C'] == 0 & df['B'] == 0)]

df['Highest Column'] = np.select(conditions, choices, default=np.nan)

When I run the above code, I get no errors, but the Highest Column in the dataframe is all NaN. It's as if the code works, but none of the conditions seem to be met (despite them being true) so only the default value is populated.

When I switch the conditions to the one that's commented out (and then comment out the previous conditions variable), I get "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

Obviously this data is just random and abstracted from my use case, but the underlying code should be nearly identical. If there is a 1 in Column C, it should be marked as Column C in the Highest Column Series in the Dataframe. If Column C is 0, but B has a 1, then Highest should be Column B. etc etc.

I know I can do this in excel really quickly, but I'd much rather learn how to do this in Python/pandas, so any advice is much appreciated!

1
  • You forgot the brackets in your outcommented conditions: (df['C'] == 0) & (df['B'] == 1), Commented Aug 9, 2019 at 22:13

1 Answer 1

4

Try:

choices = ['C Highest','B Highest','A Highest']
conditions = [
       (df['C'] == 1), 
       ((df['C'] == 0) & (df['B'] == 1)),
       ((df['A'] == 1) & (df['C'] == 0) & (df['B'] == 0))]

df['Highest Column'] = np.select(conditions, choices, default=np.nan)
# df.head()

    A   B   C   Highest Column
0   1   0   0   A Highest
1   0   0   1   C Highest
2   1   1   0   B Highest
3   1   0   1   C Highest
4   1   1   0   B Highest
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.