replace all values in range of columns based on condition

Question

How can I replace values in multiple columns based on a condition?

Suppose I have a df looking something like this:

df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})

With numpy I can change the value of a column based on a condition like this:

df['A'] = np.where((df['B'] < 5), '-', df['A'])

But how can I change the value of many columns based on a condition? Thought I could do something like below but that's not working.

df[['A','C']] = np.where((df['B'] < 5), '-', df[['A', 'C']])

I could do a loop but that does not feel very pythonic/pands

cols = ['A', 'C']

for col in cols:
    df[col] = np.where((df['B'] < 5), '-', df[col])

jezrael · Accepted Answer · 2020-04-19 12:14:26Z

3

One idea is use DataFrame.mask:

df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
print (df)
   A  C  B
0  -  -  3
1  -  -  4
2  3  3  6
3  4  4  6

Alternative solution with DataFrame.loc:

df.loc[df['B'] < 5, ['A','C']] =  '-'
print (df)
   A  C  B
0  -  -  3
1  -  -  4
2  3  3  6
3  4  4  6

Solution with numpy.where and broadcasting mask:

df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])

Performance if mixed values - numeric with string -:

df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
#400k rows
df = pd.concat([df] * 100000, ignore_index=True)

In [217]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
171 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [219]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
72.5 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [221]: %timeit df.loc[df['B'] < 5, ['A','C']] =  '-'
27.8 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performace if replace by numeric:

df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
df = pd.concat([df] * 100000, ignore_index=True)

In [229]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, 0)
187 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [231]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], 0, df[['A', 'C']])
20.8 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [233]: %timeit df.loc[df['B'] < 5, ['A','C']] =  0
61.3 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Apr 19, 2020 at 12:14

answered Apr 19, 2020 at 12:01

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

honetkri Over a year ago

Would you say .loc is faster than a numpy soltuion?

jezrael Over a year ago

@honetkri - Give me some time for test.

honetkri Over a year ago

whenever, no biggie, I have just seen other numpy solutions outperforming the others.

Collectives™ on Stack Overflow

replace all values in range of columns based on condition

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related