1

I have a dataframe like this,

col1    col2   col3
1       apple   a,b 
2       car      c
3       dog     a,c
4       dog     NaN

I tried to create three new columns, a,b and c, which give '1' if it contains a specific string, otherwise, '0'.

df['a']= np.where(df['col3'].str.contains('a'),1,0)
df['b']= np.where(df['col3'].str.contains('b'),1,0)
df['c']= np.where(df['col3'].str.contains('c'),1,0)

But it seems NaN values were not handled correctly. It gives me a result like,

col1  col2  col3    a   b   c
1    apple   a,b    1   1   0
2     car     c     0   0   1
3     dog    a,c    1   0   1
4     dog    NaN    1   1   1

It should be all '0's in the 4th row. How can I change my code to get the right answer?

7
  • Why not drop NaN before using np.where function like df = df.dropna() Commented Sep 16, 2019 at 19:34
  • @Kapil that's one possibility, but it seems OP wants to keep frame structure and append the parsed columns back, which wouldn't work if a dropna was done first. Commented Sep 16, 2019 at 19:36
  • 4
    Use df.join(df['col2'].str.get_dummies(',')) Commented Sep 16, 2019 at 19:37
  • 4
    You clearly need get_dummies, but for the sake of your question, NaNs are True values, so don't trust numpy judgement on that - explicitly fill at the end to avoid ambiguity: df.col2.str.contains('a').fillna(False) Commented Sep 16, 2019 at 19:44
  • The reason why NaNs are True can be found on the docs - you have very limited number of objects which are cast to False, and all rest is True Commented Sep 16, 2019 at 19:46

2 Answers 2

4

What I will do

s=df.col2.str.get_dummies(sep=',')
Out[29]: 
   a  b  c
0  1  1  0
1  0  0  1
2  1  0  1
3  0  0  0
df=pd.concat([df,s],axis=1)
Sign up to request clarification or add additional context in comments.

1 Comment

Both this and user3483203's comment are great, but if the column is not delimited and actually a str.contains is required then it wouldn't work :(
1

You can use fillna(False). You are using Boolean indexing so always the values ​​corresponding to NaN will be 0

df['a']= np.where(df['col2'].str.contains('a').fillna(False),1,0)
df['b']= np.where(df['col2'].str.contains('b').fillna(False),1,0)
df['c']= np.where(df['col2'].str.contains('c').fillna(False),1,0)

Output:

   col1   col2 col3  a  b  c
0     1  apple  a,b  1  0  0
1     2    car    c  1  0  1
2     3    dog  a,c  0  0  0
3     4    dog  NaN  0  0  0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.