Filter rows from a grouped data frame based on string columns

Question

I have a data frame grouped by multiple columns but in this example it would be grouped only by Year.

   Year Animal1  Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion            
3  2002   Duck            
4  2010    Dog   Cat
5  2010    Cat            
6  2010   Lion            
7  2010  Mouse

I would like for each group, from the rows where Animal2 is empty to filter out the rows where Animal2 does not appear in the column Animal1.

The expected output would be:

  Year Animal1   Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   Cat
4  2010    Cat

Rows 0 & 3 stayed since Animal2 is not empty.

Rows 1 & 2 stayed since Mouse & Lion are in Animal2 for the first group.

Row 4 stayed since cat appear in Animal2 for the second group

EDIT: I get an error for a similar input data frame

  Year Animal1   Animal2
0  2002    Dog   Mouse
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   
4  2010    Cat

The expected output would be:

  Year Animal1   Animal2
0  2002    Dog   Mouse
1  2002  Mouse

The error is triggered in the .apply(lambda g: g.isin(sets[g.name])) part of the code.

  if not any(isinstance(k, slice) for k in key):
    
                if len(key) == self.nlevels and self.is_unique:
                    # Complete key in unique index -> standard get_loc
                    try:
                        return (self._engine.get_loc(key), None)
                    except KeyError as err:
                       raise KeyError(key) from err
                         KeyError: (2010, 'Dog')

mozway · Accepted Answer · 2023-01-11 17:54:43Z

2

You can use masks and regexes:

# non empty Animal2
m1 = df['Animal2'].notna()

# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')

# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.str.fullmatch(patterns[g.name]))
     )

out = df.loc[m1|m2]

Or sets:

m1 = df['Animal2'].notna()

sets = (df.loc[m1, 'Animal2'].str.split(',')
          .groupby(df['Year'])
          .agg(lambda x: set().union(*x))
       )

m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.isin(sets[g.name]))
     )

out = df.loc[m1|m2]

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None

edited Jan 11, 2023 at 17:54

answered Jan 11, 2023 at 17:10

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

the phoenix Over a year ago

Hi @mozway, thank you for your answer :) it's working. Can you please explain to me what this part is doing: .agg('|'.join).str.replace(',', '|')

mozway Over a year ago

This is to join the strings per group and to replace the commas by | to craft a regex (Mouse|Lion for example), which will be used to match the names

the phoenix Over a year ago

Hi @mozway, I have an error if there's row which does not belong to any group. Could you please help me to fix it. I will add an edit to the description.

mozway Over a year ago

@thephoenix yes, please add an edit

the phoenix Over a year ago

Hi @mozway, sorry for the late reply. I included the edit. Could you please take a look and let me know. your help is much appreciated :)

rhug123 · Accepted Answer · 2023-01-11 18:56:23Z

1

Here is a solution using list comprehension

(df.loc[
    [a1 in a2 for a1,a2 in zip(df['Animal1'],df['Year'].map(df['Animal2'].str.split(',').groupby(df['Year']).sum()))] | 
    df['Animal2'].notna()]
    )

or

d = df['Animal2'].str.split(',').groupby(df['Year']).sum()

(df.loc[df.groupby('Year')['Animal1'].transform(lambda x: x.isin(d.loc[x.name])) | 
df['Animal2'].notna()]
)

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None

answered Jan 11, 2023 at 18:56

rhug123

8,8801 gold badge14 silver badges27 bronze badges

1 Comment

the phoenix Over a year ago

the second solution is not really working for me. I get an error for d when I try to groupby three columns instead of groupby(df["Year"]). Am I missing something here ? @rhug123

Collectives™ on Stack Overflow

Filter rows from a grouped data frame based on string columns

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related