Create new categorical variable based on multiple binary columns

Question

I have a data frame with many binary variables and I would like to create a new variable with categorical values based on many of these binary variables

My dataframe looks like this

gov_winner    corp_winner    in part
        1              0           0
        0              1           0
        0              0           1

I variable I would like to create is called winning_party and would look like this

gov_winner    corp_winner    in part    winning_party
        1              0           0             gov
        0              1           0            corp
        0              0           1         in part

I started trying the following code but haven't had success yet:

 harrington_citations = harrington_citations.assign(winning_party=lambda x: x['gov_winner'] 
 == 1 then x = 'gov' else x == 0)

Using anky_91's answer I get the following error:

TypeError: can't multiply sequence by non-int of type 'str'

There are only columns filled by 1 and 0 ?

jezrael
– jezrael

2020-01-13 15:00:24 +00:00
Commented Jan 13, 2020 at 15:00 — jezrael
– jezrael, Commented Jan 13, 2020 at 15:00
[email protected] works?

Quang Hoang
– Quang Hoang

2020-01-13 15:08:15 +00:00
Commented Jan 13, 2020 at 15:08 — Quang Hoang
– Quang Hoang, Commented Jan 13, 2020 at 15:08

anky · Accepted Answer · 2020-01-13 14:57:22Z

3

You can use a dot product:

df.assign(Winner_Party=df.dot(df.columns))
#df.assign(Winner_Party=df @ df.columns)

   gov_winner  corp_winner  in_part Winner_Party
0           1            0        0   gov_winner
1           0            1        0  corp_winner
2           0            0        1      in_part

edited Jan 13, 2020 at 14:57

answered Jan 13, 2020 at 14:54

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Graham Streich Over a year ago

I updated my answer with the error I got. A problem may be that the actual data frame I am working with has many variables not involved in this new variable I am working to create. Thanks.

Graham Streich Over a year ago

I can make a df with just the variables I am using to create this new variable and see if your answer works...

anky Over a year ago

@GrahamStreich May be you have columns which doesnt only have 1 and 0 , filter such columns out and try

BENY · Accepted Answer · 2020-01-13 15:00:21Z

3

How about idxmax, notice this will only select the first max , you have multiple cell equal to 1 per row, you may want to try Jez's solution

df['Winner_Party']=df.eq(1).idxmax(1)

answered Jan 13, 2020 at 15:00

BENY

324k22 gold badges176 silver badges250 bronze badges

Comments

jezrael · Accepted Answer · 2020-01-13 15:02:09Z

1

If there is always only one 1 per rows use DataFrame.dot, also you can filter only 1 and 0 columns before:

df1 = df.loc[:, df.isin([0,1,'0','1']).all()].astype(int)
df['Winner_Party'] = df1.dot(df1.columns)

But if there is multiple 1 per rows and need all matched values add separator and then remove it :

df['Winner_Party'] = df1.dot(df1.columns + ',').str.rstrip(',')

print (df)
   gov_winner  corp_winner  in part Winner_Party
0           1            0        0   gov_winner
1           0            1        0  corp_winner
2           0            0        1      in part

edited Jan 13, 2020 at 15:02

answered Jan 13, 2020 at 14:56

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Collectives™ on Stack Overflow

Create new categorical variable based on multiple binary columns

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related