3

Given a dataframe like:

A    B    C
1    a    yes
2    b    yes
3    a    no

I would like to change the dataframe to:

A    B    C
1    a    yes
2    b    no
3    a    no

which means that if column B has the value 'b', I want to change the column C to 'no'. Which can be represented by df[df['B']=='b']['C'].str.replace('yes','no'). But use this will not change dataframe df itself. Even I tried df[df['B']=='b']['C'] = df[df['B']=='b']['C'].str.replace('yes','no') it didn't work. I am wondering how to solve this problem.

2 Answers 2

5

Solutions with set values by mask:

df.loc[df.B == 'b', 'C'] = 'no'
print (df)
   A  B    C
0  1  a  yes
1  2  b   no
2  3  a   no

df['C'] = df['C'].mask(df.B == 'b','no')
print (df)
   A  B    C
0  1  a  yes
1  2  b   no
2  3  a   no

Solutions with replace only yes string:

df.loc[df.B == 'b', 'C'] = df['C'].replace('yes', 'no')
print (df)
   A  B    C
0  1  a  yes
1  2  b   no
2  3  a   no

df['C'] = df['C'].mask(df.B == 'b', df['C'].replace('yes', 'no'))
print (df)
   A  B    C
0  1  a  yes
1  2  b   no
2  3  a   no

Difference better seen in changed df:

print (df)
   A  B        C
0  1  a      yes
1  2  b      yes
2  3  b  another
3  4  a       no

df['C_set'] = df['C'].mask(df.B == 'b','no')
df['C_replace'] = df['C'].mask(df.B == 'b', df['C'].replace('yes', 'no'))

print (df)
   A  B        C C_set C_replace
0  1  a      yes   yes       yes
1  2  b      yes    no        no
2  3  b  another    no   another
3  4  a       no    no        no

EDIT:

In your solution is necessary only add loc:

df.loc[df['B']=='b', 'C'] = df.loc[df['B']=='b', 'C'].str.replace('yes','no')
print (df)
   A  B        C
0  1  a      yes
1  2  b       no
2  3  b  another
3  4  a       no

EDIT1:

I was really curious what method is fastest:

#[40000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)    
print (df)

In [37]: %timeit df.loc[df['B']=='b', 'C'] = df['C'].str.replace('yes','no')
10 loops, best of 3: 79.5 ms per loop

In [38]: %timeit df.loc[df['B']=='b', 'C'] = df.loc[df['B']=='b','C'].str.replace('yes','no')
10 loops, best of 3: 48.4 ms per loop

In [39]: %timeit df.loc[df['B']=='b', 'C'] = df.loc[df['B']=='b', 'C'].replace('yes','no')
100 loops, best of 3: 14.1 ms per loop

In [40]: %timeit df['C'] = df['C'].mask(df.B == 'b', df['C'].replace('yes', 'no'))
100 loops, best of 3: 10.1 ms per loop

# piRSquared solution with replace
In [53]: %timeit df.C = np.where(df.B.values == 'b', df.C.replace('yes', 'no'), df.C.values)
100 loops, best of 3: 4.74 ms per loop

EDIT1:

Better is change condition - add df.C == 'yes' or df.C.values == 'yes' if need fastest solution:

df.loc[(df.B == 'b') & (df.C == 'yes'), 'C'] = 'no'

df.C = np.where((df.B.values == 'b') & (df.C.values == 'yes'), 'no', df.C.values)
Sign up to request clarification or add additional context in comments.

8 Comments

Thanks a lot. In fact my case is a little difficult than what I have showed. I will try the methods you introduced first.
there shouldn't be a need to use df.C.replace. When df.B.values=='b', we'll be making df.C no... the replace uses excess cpu when logically it isn't needed.
In fact in the original dataframe C is a column of strings with each value contains specific words like 'yes' or 'no'. Such as "yes, it is"
@natsuapo - I am thinking about better solution - check last edit.
@mkheifetz - hmm, so need df['C'] = df['C'].str.replace(r"\ba\b",' '), it is called word boundary
|
4

np.where

df.C = np.where(df.B == 'b', 'no', df.C)

Or

df.C = np.where(df.B.values == 'b', 'no', df.C.values)

pd.Series.mask

df.C = df.C.mask(df.B == 'b', 'no')

All change df in place and yield

   A  B    C
0  1  a  yes
1  2  b   no
2  3  a   no

timing enter image description here

1 Comment

Thanks a lot. I will try where as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.