Pandas: drop row if more than one of multiple columns is zero

Question

I have a dataframe as such:

     col0   col1  col2  col3
ID1    0      2     0     2
ID2    1      1     2     10
ID3    0      1     3     4

I want to remove rows that contain zeros more than once.

I've tried to do:

cols = ['col1', etc]
df.loc[:, cols].value_counts()

But this only works for series and not dataframes.

df.loc[:, cols].count(0) <= 1

Only returns bools.

I feel like I'm close with the 2nd attempt here.

cs95 · Accepted Answer · 2019-04-13 00:46:06Z

8

Apply the condition and count the True values.

(df == 0).sum(1)

ID1    2
ID2    0
ID3    1
dtype: int64

df[(df == 0).sum(1) < 2]

     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

Alternatively, convert the integers to bool and sum that. A little more direct.

# df[(~df.astype(bool)).sum(1) < 2]
df[df.astype(bool).sum(1) > len(df.columns)-2]  # no inversion needed

     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

For performance, you can use np.count_nonzero:

# df[np.count_nonzero(df, axis=1) > len(df.columns)-2]
df[np.count_nonzero(df.values, axis=1) > len(df.columns)-2]

     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

df = pd.concat([df] * 10000, ignore_index=True)

%timeit df[(df == 0).sum(1) < 2]
%timeit df[df.astype(bool).sum(1) > len(df.columns)-2]
%timeit df[np.count_nonzero(df.values, axis=1) > len(df.columns)-2]

7.13 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.28 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
997 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Apr 13, 2019 at 0:46

answered Apr 13, 2019 at 0:39

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

PeptideWitch Over a year ago

Nice one. And if I want to further restrict the column checking, I can do df[(df[my_cols] == 0).sum(1) < 2], right? Assuming that I have a much larger dataset to begin with

cs95 Over a year ago

@PeptideWitch Yes, that should suffice.

PeptideWitch Over a year ago

Interesting #3method there - I quite like that one. Works well for my dataset too

newbieeyo Over a year ago

what means len(df.columns) - 2)..?? @cs95

BENY · Accepted Answer · 2019-04-13 01:07:42Z

4

Using

df.loc[df.eq(0).sum(1).le(1),]
     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

A fun way

df.mask(df.eq(0)).dropna(0, thresh=df.shape[1] - 1).fillna(0)
     col0  col1  col2  col3
ID2   1.0     1   2.0    10
ID3   0.0     1   3.0     4

edited Apr 13, 2019 at 1:07

answered Apr 13, 2019 at 0:49

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

cs95 Over a year ago

Are you using loc to avoid the SettingWithCopyWarning?

PeptideWitch Over a year ago

Method #1 works but only returns the IDs of the dataset, so you'd have to wrap this inside a condition to filter out the dataframe to begin with. Still, neat answer. Ty :)

vinu · Accepted Answer · 2021-02-22 16:12:45Z

0

df.replace(0, np.nan, inplace=True)
df.dropna(subset=df.columns, thresh=2, inplace=True)
df.fillna(0., inplace=True)

answered Feb 22, 2021 at 16:12

vinu

4635 silver badges11 bronze badges

Collectives™ on Stack Overflow

Pandas: drop row if more than one of multiple columns is zero

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related