A more pythonic way to filter rows across multiple columns based on single value in Python 3.6+

Question

I am looking for help on how to simplify my code. The DataFrame is >100k rows and could have multiple columns that contain a mix of strings and integers. Here is an example df:

data = {
    "Area_1": [0, 100, 200, 0],
    "Area_2": [0, 0, 100, 100],
    "Area_3": [0, 0, 0, 100],
    "id": ["gene_x", "gene_y", "gene_z", "gene_i"],
}
df = pd.DataFrame(data, columns=["id", "Area_1", "Area_2", "Area_3"])

Here is the code I thought was simplifying a chunky amount of code that worked but was only able to handle 3 columns. I now want to accept any number of columns and filter rows if all columns contain the integer 0.

Expected output: everything in the DataFrame but the row containing gene_x.

Current code:

cut=r'^Area'
blade = df.columns.str.contains(cut)
df[(df.loc[:,blade] > 0).any(axis=1)]

example of Dataframe:

Currently, this code executes without error but returns the df without filtering as expected. My expectation would be the removal of the any rows which do not contain a value >0

for the sample data frame. It would only be the removal row where id = gene_x. I have provided a screenshot of my dataframe. here it would the the middle rows which do not contain an integer value > 0. There is 40 other columns that contain strings or other values which I would like to keep but I only want to apply the condition to columns which contain "Area.....". — thejahcoop
– thejahcoop, Commented Mar 24, 2021 at 19:32

jkr · Accepted Answer · 2021-03-24 20:01:23Z

1

One can try the following.

Create dataframe

import pandas as pd

data = {
    "Area_1": [0, 100, 200, 0],
    "Area_2": [0, 0, 100, 100],
    "Area_3": [0, 0, 0, 100],
    "id": ["gene_x", "gene_y", "gene_z", "gene_i"],
}
df = pd.DataFrame(data, columns=["id", "Area_1", "Area_2", "Area_3"])
df = df.set_index("id")
print(df)

Output:

        Area_1  Area_2  Area_3
id                            
gene_x       0       0       0
gene_y     100       0       0
gene_z     200     100       0
gene_i       0     100     100

Create a boolean mask indicating rows we want

# Subset the columns we are interested in.
df_tmp = df.filter(regex="^Area_", axis="columns")
mask = df_tmp == 0
print(mask.head())

# Collapse across columns
all_cols_zero = mask.all(axis=1)
print(all_cols_zero)

Output:

        Area_1  Area_2  Area_3
id                            
gene_x    True    True    True
gene_y   False    True    True
gene_z   False   False    True
gene_i    True   False   False

id
gene_x     True
gene_y    False
gene_z    False
gene_i    False
dtype: bool

Apply the boolean mask to our original dataframe

# Keep rows where at least one column is non-zero.
# The ~ gets the inverse. So True becomes False.
df.loc[~all_cols_zero, :]

Output:

        Area_1  Area_2  Area_3
id                            
gene_y     100       0       0
gene_z     200     100       0
gene_i       0     100     100

answered Mar 24, 2021 at 20:01

jkr

19.6k5 gold badges48 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

thejahcoop Over a year ago

makes sense, I will give it go.

thejahcoop Over a year ago

this isn't working for my dataframe. I have NaN values for measurements which do not exist so I am not sure using the df == 0 as a mask will work. I am going to try it after replacing NaN with 0.

jkr Over a year ago

Sure, if replace NaN with 0 works for your data, then that is fine. You can also modify the mask expression to include NaNs somehow. You can also try modifying the skipna parameter of pandas.Series.all(), which I used to collapse the boolean values to one value per row.

Collectives™ on Stack Overflow

A more pythonic way to filter rows across multiple columns based on single value in Python 3.6+

1 Answer 1

Create dataframe

Create a boolean mask indicating rows we want

Apply the boolean mask to our original dataframe

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Create dataframe

Create a boolean mask indicating rows we want

Apply the boolean mask to our original dataframe

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related