1

I'm looking to create a custom function based off several columns (

`TOTAL_HH_INCOME','HH_SIZE'

'Eligible Household Size', 'income_min1', 'income_max1', 'hh_size2','income_min2', 'income_max2', 'hh_size3', 'income_min3', 'income_max3', 'hh_size4', 'income_min4', 'income_max4', 'hh_size5', 'income_min5', 'income_max5', 'hh_size6', 'income_min6', 'income_max6'`

I'm looking to compare HH Size vs each HH size# variable and TOTAL_HH_INCOME vs every income_min & income_max variable for each row in my dataframe.

I've made this function as an attempt

def eligibility (row):
    
    if df['HH_SIZE']== df['Eligible Household Size'] & df['TOTAL_HH_INCOME'] >= df['income_min1'] & df['TOTAL_HH_INCOME'] <=row['income_max1'] :
        return 'Eligible'
    
    if df['HH_SIZE']== df['hh_size2'] & df['TOTAL_HH_INCOME'] >= df['income_min2'] & df['TOTAL_HH_INCOME'] <=row['income_max2'] :
        return 'Eligible'
    
    if df['HH_SIZE']== df['hh_size3'] & df['TOTAL_HH_INCOME'] >= df['income_min3'] & df['TOTAL_HH_INCOME'] <=row['income_max3'] :
        return 'Eligible'

    if df['HH_SIZE']== df['hh_size4'] & df['TOTAL_HH_INCOME'] >= df['income_min4'] & df['TOTAL_HH_INCOME'] <=row['income_max4'] :
        return 'Eligible'

    if df['HH_SIZE']== df['hh_size5'] & df['TOTAL_HH_INCOME'] >= df['income_min5'] & df['TOTAL_HH_INCOME'] <=row['income_max5'] :
        return 'Eligible'

    if df['HH_SIZE']== df['hh_size6'] & df['TOTAL_HH_INCOME'] >= df['income_min6'] & df['TOTAL_HH_INCOME'] <=row['income_max6'] :
        return 'Eligible'
    
    return 'Ineligible'

As you can see if the row meets a condition I want the row to be labeled as "Eligible" if not it should be labeled 'Ineligible'

I applied this function to my df with

df['Eligibility']= df.apply(eligibility, axis=1)

However, i receive an error:

ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0')

Why? Is my function off the mark?

EDIT:

====================== DATAFRAME ===========================

enter image description here

2
  • 1
    Can you provide a minimal reproducible sample dataframe? I believe the error is because each condition has to be enclosed in parentheses, but can't verify without a sample dataset Commented Jun 30, 2020 at 15:16
  • Hi added a screen grab form the underlying csv. Commented Jun 30, 2020 at 15:28

2 Answers 2

1

The problem seems to be the comparison operators in the if statements: because you are comparing columns of a data frame, there is not just one True values but there are as many True values as items in a column.

Try using a.all(), if you want all of the elements to be the same. Please refer to the example below:

import pandas as pd
dict1 = {'name1': ['tom', 'pedro'], 'name2': ['tom', 'pedro'],
         'name3': ['tome', 'maria'], 'name4': ['maria', 'marta']}
df1 = pd.DataFrame(dict1)

# This produce a ValueError as the one you have
# if df1['name1'] == df1['name2']:
#     pass
# To see why this produce an error try printing the following:
print('This is a DataFrame of bool values an can not be handle by an if statement: \n',
      df1['name1'] == df1['name2'])

# This check if all the elements in 'name1' are the same as in 'name2'
if (df1['name1'] == df1['name2']).all():
    print('\nEligible')

Output:

This is a DataFrame of bool values an can not be handle by an if statement: 
 0    True
 1    True
dtype: bool

Eligible
Sign up to request clarification or add additional context in comments.

1 Comment

How would this would in my instance? I get i want matching values for Household sizes . But after that I also want to see if that same household meets the income criteria by comparing its household income vs the income criteria range (income min and max).
0

You could try this, using df.to_records():

import re

#df.columns
s=['TOTAL_HH_INCOME','HH_SIZE','Eligible Household Size', 'income_min1', 'income_max1', 'hh_size2','income_min2', 'income_max2', 'hh_size3', 'income_min3', 'income_max3', 'hh_size4', 'income_min4', 'income_max4', 'hh_size5', 'income_min5', 'income_max5', 'hh_size6', 'income_min6', 'income_max6']


def func(row):
    totalincome=row[2]
    HHSIZE=row[3]
    indexhhsize=list(map(s.index,re.findall('(hh_size\d+)',''.join(s))))
    indexmax=list(map(s.index,re.findall('(income_max\d+)',''.join(s))))
    indexmin=list(map(s.index,re.findall('(income_min\d+)',''.join(s))))

    if(any(HHSIZE==row[i+1] for i in indexhhsize))\
    |(any(totalincome>=row[i+1] for i in indexmin))\
    |(any(totalincome<=row[i+1] for i in indexmax)):
        return 'Eligible'
    else:
        return 'Ineligible'
    
df['Eligibility']=[func(row) for row in df.to_records()]
        

6 Comments

Ran it but: TypeError: '<=' not supported between instances of 'numpy.ndarray' and 'str'
Could you add a sample of your dataframe to make test please?
Hi having trouble adding the df as text but included a image from excel of df.head(10) as an example
Oh I see, yo have multiple NaN values... maybe you can try before of all: df.fillna(0)
Thanks for the help mate. Gave it a go but same error re: TypeError: '<=' not supported between instances of 'numpy.ndarray' and 'str'
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.