5

I have a dataframe as such:

     col0   col1  col2  col3
ID1    0      2     0     2
ID2    1      1     2     10
ID3    0      1     3     4

I want to remove rows that contain zeros more than once.

I've tried to do:

cols = ['col1', etc]
df.loc[:, cols].value_counts()

But this only works for series and not dataframes.

df.loc[:, cols].count(0) <= 1

Only returns bools.

I feel like I'm close with the 2nd attempt here.

3 Answers 3

8

Apply the condition and count the True values.

(df == 0).sum(1)

ID1    2
ID2    0
ID3    1
dtype: int64

df[(df == 0).sum(1) < 2]

     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

Alternatively, convert the integers to bool and sum that. A little more direct.

# df[(~df.astype(bool)).sum(1) < 2]
df[df.astype(bool).sum(1) > len(df.columns)-2]  # no inversion needed

     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

For performance, you can use np.count_nonzero:

# df[np.count_nonzero(df, axis=1) > len(df.columns)-2]
df[np.count_nonzero(df.values, axis=1) > len(df.columns)-2]

     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

df = pd.concat([df] * 10000, ignore_index=True)

%timeit df[(df == 0).sum(1) < 2]
%timeit df[df.astype(bool).sum(1) > len(df.columns)-2]
%timeit df[np.count_nonzero(df.values, axis=1) > len(df.columns)-2]

7.13 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.28 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
997 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

4 Comments

Nice one. And if I want to further restrict the column checking, I can do df[(df[my_cols] == 0).sum(1) < 2], right? Assuming that I have a much larger dataset to begin with
@PeptideWitch Yes, that should suffice.
Interesting #3method there - I quite like that one. Works well for my dataset too
what means len(df.columns) - 2)..?? @cs95
4

Using

df.loc[df.eq(0).sum(1).le(1),]
     col0  col1  col2  col3
ID2     1     1     2    10
ID3     0     1     3     4

A fun way

df.mask(df.eq(0)).dropna(0, thresh=df.shape[1] - 1).fillna(0)
     col0  col1  col2  col3
ID2   1.0     1   2.0    10
ID3   0.0     1   3.0     4    

2 Comments

Are you using loc to avoid the SettingWithCopyWarning?
Method #1 works but only returns the IDs of the dataset, so you'd have to wrap this inside a condition to filter out the dataframe to begin with. Still, neat answer. Ty :)
0
df.replace(0, np.nan, inplace=True)
df.dropna(subset=df.columns, thresh=2, inplace=True)
df.fillna(0., inplace=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.