2

I have a dataframe containing counts of two things, which I've put in columns numA and numB. I want to find the rows where numA < x and numB < y, which can be done like so:

filtered_df = df[(df.numA < x) & (df.numB < y)]

This works when both numA and numB are present. However neither column is guaranteed to appear in the dataframe. If only one column exists, I would still like to filter the rows based on it. This could be easily coded with something along the lines of

if "numA" in df.columns:
    filtered_df = df[df.numA < x]
if "numB" in df.columns:
    filtered_df = filtered_df[filtered_df.numB < y]

But this seems very inefficient, especially since in reality I have 9 columns like this, and each of these requires the same check. Is there a way to achieve the same thing but with code that is more readable, easier to maintain and less tedious to write out?

3
  • You could fill in the missing entries with a default value (that is higher than your check value and therefore will evaluate to True) - pandas.DataFrame.fillna. Use a value that wouldn't occur and/or use a copy of the data so the the fact there are missing entries doesn't get permanently wiped. Commented Feb 4, 2021 at 18:17
  • Theoretically you could chain operators together with isnull(). Not at my desk at the moment to test in full, hence a comment not an answer. e.g. (df.numA < x) | (df.numA.isnull()) Commented Feb 4, 2021 at 18:20
  • @Alan there are no null values, the entire columns are potentially absent in the dataframe. For some context: I begin with a large dataframe containing more columns, and then delete some of the columns, potentially removing numA and numB along the way depending on certain conditions. So if numA isn't present then df.numA.isnull() returns an error: 'DataFrame' object has no attribute 'numA'. Commented Feb 4, 2021 at 19:24

2 Answers 2

2

If you want an all-or-nothing type comparison I think a fairly easy way is to use set comparisons:

if(set(list_of_cols_to_check).issubset(df.columns)):
    filtered_df = df[(df.numA < x) & ... & (df.numB < y)]

If you want to perform comparisons for all that do exist it gets a bit more complicated. It is not very different than what you have, but I'd probably do it as follows:

filter = (df.index >= 0) #always true
filter = filter & (df.numA < 4)  if 'numA' in df else filter
filter = filter & (df.numB < 2)  if 'numB' in df else filter
filter = filter & (df.numC < 1)  if 'numC' in df else filter
df[filter]
Sign up to request clarification or add additional context in comments.

Comments

0

You can use a simpler solution if you're not sure if the columns will be there when you filter for them:

df.loc[:,df.column.isin(["numA", "numB", "numC"])]
  • isin() will return an array of booleans if the column exists
  • .loc[] takes index and column as arguments
  • the : means all indices
  • the second argument is the array of columns as booleans

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.