2

I have a data frame grouped by multiple columns but in this example it would be grouped only by Year.

   Year Animal1  Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion            
3  2002   Duck            
4  2010    Dog   Cat
5  2010    Cat            
6  2010   Lion            
7  2010  Mouse      

I would like for each group, from the rows where Animal2 is empty to filter out the rows where Animal2 does not appear in the column Animal1.

The expected output would be:

  Year Animal1   Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   Cat
4  2010    Cat                        

Rows 0 & 3 stayed since Animal2 is not empty.

Rows 1 & 2 stayed since Mouse & Lion are in Animal2 for the first group.

Row 4 stayed since cat appear in Animal2 for the second group

EDIT: I get an error for a similar input data frame

  Year Animal1   Animal2
0  2002    Dog   Mouse
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   
4  2010    Cat  
   

The expected output would be:

  Year Animal1   Animal2
0  2002    Dog   Mouse
1  2002  Mouse    
     

The error is triggered in the .apply(lambda g: g.isin(sets[g.name])) part of the code.

  if not any(isinstance(k, slice) for k in key):
    
                if len(key) == self.nlevels and self.is_unique:
                    # Complete key in unique index -> standard get_loc
                    try:
                        return (self._engine.get_loc(key), None)
                    except KeyError as err:
                       raise KeyError(key) from err
                         KeyError: (2010, 'Dog')

           
0

2 Answers 2

2

You can use masks and regexes:

# non empty Animal2
m1 = df['Animal2'].notna()

# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')

# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.str.fullmatch(patterns[g.name]))
     )

out = df.loc[m1|m2]

Or sets:

m1 = df['Animal2'].notna()

sets = (df.loc[m1, 'Animal2'].str.split(',')
          .groupby(df['Year'])
          .agg(lambda x: set().union(*x))
       )

m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.isin(sets[g.name]))
     )

out = df.loc[m1|m2]

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None
Sign up to request clarification or add additional context in comments.

5 Comments

Hi @mozway, thank you for your answer :) it's working. Can you please explain to me what this part is doing: .agg('|'.join).str.replace(',', '|')
This is to join the strings per group and to replace the commas by | to craft a regex (Mouse|Lion for example), which will be used to match the names
Hi @mozway, I have an error if there's row which does not belong to any group. Could you please help me to fix it. I will add an edit to the description.
@thephoenix yes, please add an edit
Hi @mozway, sorry for the late reply. I included the edit. Could you please take a look and let me know. your help is much appreciated :)
1

Here is a solution using list comprehension

(df.loc[
    [a1 in a2 for a1,a2 in zip(df['Animal1'],df['Year'].map(df['Animal2'].str.split(',').groupby(df['Year']).sum()))] | 
    df['Animal2'].notna()]
    )

or

d = df['Animal2'].str.split(',').groupby(df['Year']).sum()

(df.loc[df.groupby('Year')['Animal1'].transform(lambda x: x.isin(d.loc[x.name])) | 
df['Animal2'].notna()]
)

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None

1 Comment

the second solution is not really working for me. I get an error for d when I try to groupby three columns instead of groupby(df["Year"]). Am I missing something here ? @rhug123

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.