1

How can I replace values in multiple columns based on a condition?

Suppose I have a df looking something like this:

df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})

With numpy I can change the value of a column based on a condition like this:

df['A'] = np.where((df['B'] < 5), '-', df['A'])

But how can I change the value of many columns based on a condition? Thought I could do something like below but that's not working.

df[['A','C']] = np.where((df['B'] < 5), '-', df[['A', 'C']])

I could do a loop but that does not feel very pythonic/pands

cols = ['A', 'C']

for col in cols:
    df[col] = np.where((df['B'] < 5), '-', df[col])

1 Answer 1

3

One idea is use DataFrame.mask:

df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
print (df)
   A  C  B
0  -  -  3
1  -  -  4
2  3  3  6
3  4  4  6

Alternative solution with DataFrame.loc:

df.loc[df['B'] < 5, ['A','C']] =  '-'
print (df)
   A  C  B
0  -  -  3
1  -  -  4
2  3  3  6
3  4  4  6

Solution with numpy.where and broadcasting mask:

df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])

Performance if mixed values - numeric with string -:

df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
#400k rows
df = pd.concat([df] * 100000, ignore_index=True)

In [217]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
171 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [219]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
72.5 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [221]: %timeit df.loc[df['B'] < 5, ['A','C']] =  '-'
27.8 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performace if replace by numeric:

df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
df = pd.concat([df] * 100000, ignore_index=True)

In [229]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, 0)
187 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [231]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], 0, df[['A', 'C']])
20.8 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [233]: %timeit df.loc[df['B'] < 5, ['A','C']] =  0
61.3 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

3 Comments

Would you say .loc is faster than a numpy soltuion?
@honetkri - Give me some time for test.
whenever, no biggie, I have just seen other numpy solutions outperforming the others.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.