1

I am new to Python (used to coding with cousin R) and am still getting a hang of pandas. There is an incredibly helpful, related post., but instead of filter()ing by a set number, I was hoping to do so by a criteria defined in a second data set.

Let's make some toy data:

import pandas as pd

pets = [['foxhound', 'dog', 20], ['husky', 'dog', 25], ['GSD', 'dog', 24],['Labrador', 'dog', 23],['Persian', 'cat', 7],['Siamese', 'cat', 6],['Tabby', 'cat', 5]]

df = pd.DataFrame(pets , columns = ['breed', 'species','height']).set_index('breed')

TooBigForManhattan = [['dog', 22],['cat', 6]]

TooBig = pd.DataFrame(TooBigForManhattan, columns = ['species','height']).set_index('species')

I am trying to subset df() by selecting the breeds that are less than or equal to the TooBig() values. My pseudo-code looks like:

df.groupby(['breed','species']).filter(lambda x : (x['height']<='TooBig Cutoff by Species').any())

The data I am working with are thousands of entries with about a hundred criteria, so any help in defining a solution that could work at that scale would be very helpful.

Thanks in advance!

2 Answers 2

3

With a join on a single column you can map each species to its height and check whether the value in the DataFrame is smaller.

df[df['height'] <= df['species'].map(dict(TooBigForManhattan))]

         species  height
breed                   
foxhound     dog      20
Siamese      cat       6
Tabby        cat       5

Here's a bit more detail about some of the intermediate steps.

# List of lists becomes this dict
dict(TooBigForManhattan)
#{'cat': 6, 'dog': 22}

# We use this Boolean Series to slice the DataFrame
df.height <= df.species.map(dict(TooBigForManhattan))
#breed
#foxhound     True
#husky       False
#GSD         False
#Labrador    False
#Persian     False
#Siamese      True
#Tabby        True
#dtype: bool
Sign up to request clarification or add additional context in comments.

3 Comments

Nice one, just incase the mapper is a 1 column df , df[df['height'] <= df['species'].map(TooBig.squeeze())]
Big thanks for showing the steps. I am getting a KeyError on my actual data using df[df['height'] <= df['species'].map(dict(TooBigForManhattan))], but running through the intermediate steps, I get a clean Boolean output. How can I go from the big df() with True and False to the final output?
@EBITDAN hmm, print df.columns and make sure height and species are in there. Did you perhaps try the other solution and merge that way your columns got messed up?
3

I believe you need merge with which you can use df.query

out = (df.merge(TooBig,left_on='species',right_index=True,suffixes=('','_y'))
         .query("height<=height_y").loc[:,df.columns])
print(out)

Or similarly:

out = df.merge(TooBig,left_on='species',right_index=True,suffixes=('','_y'))
out = out[out['height']<=out['height_y']].reindex(columns=df.columns)
print(out)

         species  height
breed                   
foxhound     dog      20
Siamese      cat       6
Tabby        cat       5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.