Python/Pandas: subset a Dataframe by matched index criteria

Question

I am new to Python (used to coding with cousin R) and am still getting a hang of pandas. There is an incredibly helpful, related post., but instead of filter()ing by a set number, I was hoping to do so by a criteria defined in a second data set.

Let's make some toy data:

import pandas as pd

pets = [['foxhound', 'dog', 20], ['husky', 'dog', 25], ['GSD', 'dog', 24],['Labrador', 'dog', 23],['Persian', 'cat', 7],['Siamese', 'cat', 6],['Tabby', 'cat', 5]]

df = pd.DataFrame(pets , columns = ['breed', 'species','height']).set_index('breed')

TooBigForManhattan = [['dog', 22],['cat', 6]]

TooBig = pd.DataFrame(TooBigForManhattan, columns = ['species','height']).set_index('species')

I am trying to subset df() by selecting the breeds that are less than or equal to the TooBig() values. My pseudo-code looks like:

df.groupby(['breed','species']).filter(lambda x : (x['height']<='TooBig Cutoff by Species').any())

The data I am working with are thousands of entries with about a hundred criteria, so any help in defining a solution that could work at that scale would be very helpful.

Thanks in advance!

ALollz · Accepted Answer · 2020-03-11 15:11:26Z

3

With a join on a single column you can map each species to its height and check whether the value in the DataFrame is smaller.

df[df['height'] <= df['species'].map(dict(TooBigForManhattan))]

         species  height
breed                   
foxhound     dog      20
Siamese      cat       6
Tabby        cat       5

Here's a bit more detail about some of the intermediate steps.

# List of lists becomes this dict
dict(TooBigForManhattan)
#{'cat': 6, 'dog': 22}

# We use this Boolean Series to slice the DataFrame
df.height <= df.species.map(dict(TooBigForManhattan))
#breed
#foxhound     True
#husky       False
#GSD         False
#Labrador    False
#Persian     False
#Siamese      True
#Tabby        True
#dtype: bool

answered Mar 11, 2020 at 15:11

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

anky Over a year ago

Nice one, just incase the mapper is a 1 column df , df[df['height'] <= df['species'].map(TooBig.squeeze())]

EBITDAN Over a year ago

Big thanks for showing the steps. I am getting a KeyError on my actual data using df[df['height'] <= df['species'].map(dict(TooBigForManhattan))], but running through the intermediate steps, I get a clean Boolean output. How can I go from the big df() with True and False to the final output?

ALollz Over a year ago

@EBITDAN hmm, print df.columns and make sure height and species are in there. Did you perhaps try the other solution and merge that way your columns got messed up?

anky · Accepted Answer · 2020-03-11 15:07:11Z

3

I believe you need merge with which you can use df.query

out = (df.merge(TooBig,left_on='species',right_index=True,suffixes=('','_y'))
         .query("height<=height_y").loc[:,df.columns])
print(out)

Or similarly:

out = df.merge(TooBig,left_on='species',right_index=True,suffixes=('','_y'))
out = out[out['height']<=out['height_y']].reindex(columns=df.columns)
print(out)

         species  height
breed                   
foxhound     dog      20
Siamese      cat       6
Tabby        cat       5

answered Mar 11, 2020 at 15:07

anky

75.3k11 gold badges46 silver badges76 bronze badges

Collectives™ on Stack Overflow

Python/Pandas: subset a Dataframe by matched index criteria

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related