1

I am looking for help on how to simplify my code. The DataFrame is >100k rows and could have multiple columns that contain a mix of strings and integers. Here is an example df:

data = {
    "Area_1": [0, 100, 200, 0],
    "Area_2": [0, 0, 100, 100],
    "Area_3": [0, 0, 0, 100],
    "id": ["gene_x", "gene_y", "gene_z", "gene_i"],
}
df = pd.DataFrame(data, columns=["id", "Area_1", "Area_2", "Area_3"])

Here is the code I thought was simplifying a chunky amount of code that worked but was only able to handle 3 columns. I now want to accept any number of columns and filter rows if all columns contain the integer 0.

Expected output: everything in the DataFrame but the row containing gene_x.

Current code:

cut=r'^Area'
blade = df.columns.str.contains(cut)
df[(df.loc[:,blade] > 0).any(axis=1)]

example of Dataframe: dataframe

Currently, this code executes without error but returns the df without filtering as expected. My expectation would be the removal of the any rows which do not contain a value >0

3
  • 1
    Can you please provide sample inputs and expected outputs? Commented Mar 24, 2021 at 19:25
  • for the sample data frame. It would only be the removal row where id = gene_x. I have provided a screenshot of my dataframe. here it would the the middle rows which do not contain an integer value > 0. There is 40 other columns that contain strings or other values which I would like to keep but I only want to apply the condition to columns which contain "Area.....". Commented Mar 24, 2021 at 19:32
  • example data frame has been updated again. Commented Mar 24, 2021 at 19:46

1 Answer 1

1

One can try the following.

Create dataframe

import pandas as pd

data = {
    "Area_1": [0, 100, 200, 0],
    "Area_2": [0, 0, 100, 100],
    "Area_3": [0, 0, 0, 100],
    "id": ["gene_x", "gene_y", "gene_z", "gene_i"],
}
df = pd.DataFrame(data, columns=["id", "Area_1", "Area_2", "Area_3"])
df = df.set_index("id")
print(df)

Output:

        Area_1  Area_2  Area_3
id                            
gene_x       0       0       0
gene_y     100       0       0
gene_z     200     100       0
gene_i       0     100     100

Create a boolean mask indicating rows we want

# Subset the columns we are interested in.
df_tmp = df.filter(regex="^Area_", axis="columns")
mask = df_tmp == 0
print(mask.head())

# Collapse across columns
all_cols_zero = mask.all(axis=1)
print(all_cols_zero)

Output:

        Area_1  Area_2  Area_3
id                            
gene_x    True    True    True
gene_y   False    True    True
gene_z   False   False    True
gene_i    True   False   False

id
gene_x     True
gene_y    False
gene_z    False
gene_i    False
dtype: bool

Apply the boolean mask to our original dataframe

# Keep rows where at least one column is non-zero.
# The ~ gets the inverse. So True becomes False.
df.loc[~all_cols_zero, :]

Output:

        Area_1  Area_2  Area_3
id                            
gene_y     100       0       0
gene_z     200     100       0
gene_i       0     100     100
Sign up to request clarification or add additional context in comments.

3 Comments

makes sense, I will give it go.
this isn't working for my dataframe. I have NaN values for measurements which do not exist so I am not sure using the df == 0 as a mask will work. I am going to try it after replacing NaN with 0.
Sure, if replace NaN with 0 works for your data, then that is fine. You can also modify the mask expression to include NaNs somehow. You can also try modifying the skipna parameter of pandas.Series.all(), which I used to collapse the boolean values to one value per row.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.