How to avoid NaN when using np.where function in python?

Question

I have a dataframe like this,

col1    col2   col3
1       apple   a,b 
2       car      c
3       dog     a,c
4       dog     NaN

I tried to create three new columns, a,b and c, which give '1' if it contains a specific string, otherwise, '0'.

df['a']= np.where(df['col3'].str.contains('a'),1,0)
df['b']= np.where(df['col3'].str.contains('b'),1,0)
df['c']= np.where(df['col3'].str.contains('c'),1,0)

But it seems NaN values were not handled correctly. It gives me a result like,

col1  col2  col3    a   b   c
1    apple   a,b    1   1   0
2     car     c     0   0   1
3     dog    a,c    1   0   1
4     dog    NaN    1   1   1

It should be all '0's in the 4th row. How can I change my code to get the right answer?

Why not drop NaN before using np.where function like df = df.dropna() — PapaDiHatti
– PapaDiHatti, Commented Sep 16, 2019 at 19:34
@Kapil that's one possibility, but it seems OP wants to keep frame structure and append the parsed columns back, which wouldn't work if a dropna was done first. — r.ook
– r.ook, Commented Sep 16, 2019 at 19:36
You clearly need get_dummies, but for the sake of your question, NaNs are True values, so don't trust numpy judgement on that - explicitly fill at the end to avoid ambiguity: df.col2.str.contains('a').fillna(False) — rafaelc
– rafaelc, Commented Sep 16, 2019 at 19:44
The reason why NaNs are True can be found on the docs - you have very limited number of objects which are cast to False, and all rest is True — rafaelc
– rafaelc, Commented Sep 16, 2019 at 19:46

BENY · Accepted Answer · 2019-09-16 19:39:41Z

4

What I will do

s=df.col2.str.get_dummies(sep=',')
Out[29]: 
   a  b  c
0  1  1  0
1  0  0  1
2  1  0  1
3  0  0  0
df=pd.concat([df,s],axis=1)

answered Sep 16, 2019 at 19:39

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

r.ook Over a year ago

Both this and user3483203's comment are great, but if the column is not delimited and actually a str.contains is required then it wouldn't work :(

ansev · Accepted Answer · 2019-09-16 19:56:46Z

1

You can use fillna(False). You are using Boolean indexing so always the values corresponding to NaN will be 0

df['a']= np.where(df['col2'].str.contains('a').fillna(False),1,0)
df['b']= np.where(df['col2'].str.contains('b').fillna(False),1,0)
df['c']= np.where(df['col2'].str.contains('c').fillna(False),1,0)

Output:

   col1   col2 col3  a  b  c
0     1  apple  a,b  1  0  0
1     2    car    c  1  0  1
2     3    dog  a,c  0  0  0
3     4    dog  NaN  0  0  0

edited Sep 16, 2019 at 19:56

answered Sep 16, 2019 at 19:49

ansev

31k5 gold badges21 silver badges33 bronze badges

Collectives™ on Stack Overflow

How to avoid NaN when using np.where function in python?

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related