3

I want to add a new column in a dataframe with the names of other columns as values, based on a condition.

import pandas as pd
data = pd.DataFrame({
'customer': ['bob', 'jerry', 'alice', 'susan'],
'internet_bill': ['paid', 'past_due', 'due_soon', 'past_due'],
'electric_bill': ['past_due', 'due_soon', 'past_due', 'paid'],
'water_bill': ['paid', 'past_due', 'paid', 'paid']})

Here's the dataframe.

    customer    internet_bill   electric_bill   water_bill
0   bob         paid            past_due        paid
1   jerry       past_due        due_soon        past_due
2   alice       due_soon        past_due        paid
3   susan       past_due        paid            paid

I want to add a new column summarizing what is 'past_due'. Here's the desired result:

    customer    internet_bill   electric_bill   water_bill  past_due
0   bob         past_due        past_due        past_due    internet_bill, electric_bill, water_bill
1   jerry       past_due        due_soon        past_due    internet_bill, water_bill
2   alice       due_soon        past_due        paid        electric_bill
3   susan       past_due        paid            paid        internet_bill

I was able to do this in Excel with the following formula:

=TEXTJOIN(","&CHAR(10),TRUE,
IF(B2=Values!$A$1,$K$1,""),
IF(C2=Values!$A$1,$L$1,""),
IF(D2=Values!$A$1,$M$1,""))

Ultimately, my output will be an excel file for some nurses & hospital workers to follow up with patients (not bill collecting! Patient care stuff). I have thought about using an excel writer library to just create an .xlsx and insert formulas.

AND - I was able to do this to catch one column, but my gut tells me there's a much better way. Here's what I used to do that:

both['past_due'] = [
'internet_bill' if x == 'PAST_DUE' 
else 'None' for x in df['internet_bill']]

This would basically check the row in each targeted column if that row contained 'PAST_DUE', and if so, it would return the column name, move on to the next column, check for past due, add the column name.

I have had no success in finding anything close to this with searches, probably due to struggling to form a good question in the search bar. I haven't found any questions where someone is trying to pull other column names as a value based on a condition.

Thanks for any help!

1 Answer 1

3
  >>>data['past_due'] = data.apply(lambda x: tuple(x[x == 'past_due'].index), 
  axis=1)
  >>>data
  Out[75]: 
    customer             ...                                  past_due
  0      bob             ...                          (electric_bill,)
  1    jerry             ...               (internet_bill, water_bill)
  2    alice             ...                          (electric_bill,)
  3    susan             ...                          (internet_bill,)
  [4 rows x 5 columns]
Sign up to request clarification or add additional context in comments.

2 Comments

This worked perfectly! Thank you! I need to learn more about lambda. This is miles better than the other approach I was taking in case I didn't get an answer. Much appreciated.
This approach works great, but is very slow compared to similar operations like data['sum'] = data[numerical_column_names].sum(axis=1). Is there a faster variation?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.