I am trying to generate a new column on my existing dataframe that is built off conditional statements with the input being data from multiple columns in the dataframe.
I'm using the np.select() method as I read this is the best way to use multiple columns as inputs to levels of conditions. However, when I run the code, the default value is populated, even when criteria in the rows is met. Below is some exampel code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,2, size=(20,3)), columns = list('ABC'))
choices = ['C Highest','B Highest','A Highest']
conditions = [
(df['C'] is True),
(df['C'] is False & df['B'] is True),
(df['A'] is True & df['C']is False & df['B'] is False)]
#conditions = [
# (df['C'] == 1),
# (df['C'] == 0 & df['B'] == 1),
# (df['A'] == 1 & df['C'] == 0 & df['B'] == 0)]
df['Highest Column'] = np.select(conditions, choices, default=np.nan)
When I run the above code, I get no errors, but the Highest Column in the dataframe is all NaN. It's as if the code works, but none of the conditions seem to be met (despite them being true) so only the default value is populated.
When I switch the conditions to the one that's commented out (and then comment out the previous conditions variable), I get "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
Obviously this data is just random and abstracted from my use case, but the underlying code should be nearly identical. If there is a 1 in Column C, it should be marked as Column C in the Highest Column Series in the Dataframe. If Column C is 0, but B has a 1, then Highest should be Column B. etc etc.
I know I can do this in excel really quickly, but I'd much rather learn how to do this in Python/pandas, so any advice is much appreciated!
(df['C'] == 0) & (df['B'] == 1),