How to filter a pandas dataframe by string values and matching integers in rows?

Question

Thank you for your help. I am still relatively new to pandas and do not observe this specific kind of query in search results.

I have a pandas dataframe:

+-----+---------+----------+
| id  |  value  | match_id |
+-----+---------+----------+
| A10 | grass   |        1 |
| B45 | cow     |        3 |
| B98 | bird    |        6 |
| B17 | grass   |        1 |
| A20 | tree    |        2 |
| A87 | farmer  |        5 |
| B11 | grass   |        1 |
| A33 | chicken |        4 |
| B56 | tree    |        2 |
| A23 | farmer  |        5 |
| B65 | cow     |        3 |
+-----+---------+----------+

I need to filter this dataframe for rows that contain matching match_id values, with the condition that the id column must also contain both strings A and B.

This is the expected output:

+-----+-------+----------+
| id  | value | match_id |
+-----+-------+----------+
| A10 | grass |        1 |
| B17 | grass |        1 |
| A20 | tree  |        2 |
| B11 | grass |        1 |
| B56 | tree  |        2 |
+-----+-------+----------+

How can I do this in, say, a single line of pandas code? Reproducible program below:

import pandas as pd

data_example = {'id': ['A10', 'B45', 'B98', 'B17', 'A20', 'A87', 'B11', 'A33', 'B56', 'A23', 'B65'], 
                'value': ['grass', 'cow', 'bird', 'grass', 'tree', 'farmer', 'grass', 'chicken', 'tree', 'farmer', 'cow'], 
                'match_id': [1, 3, 6, 1, 2, 5, 1, 4, 2, 5, 3]}
df_example = pd.DataFrame(data=data_example)

data_expected = {'id': ['A10', 'B17', 'A20', 'B11', 'B56'], 
                'value': ['grass', 'grass', 'tree', 'grass', 'tree'], 
                'match_id': [1, 1, 2, 1, 2]}
df_expected = pd.DataFrame(data=data_expected)

Thank you!

This is an excellently posed question. Thanks for taking the time to pull together the date and examples in a runnable format. Second part is pretty easy, but the first is trickier. Stand by. — Paul H
– Paul H, Commented May 1, 2020 at 18:15
Why does the row with B56,tree,2 get included in the final output? While the ID contains B, it doesn't also contain 2 — Paul H
– Paul H, Commented May 1, 2020 at 18:22
@PaulH Thank you, I mean to filter by two conditions: 1.) by rows in match_id column that have matching integers, and 2.) by rows in id column that contain string values both A and B per rows of matching match_id rows. Is this helpful? — rabbittas2739
– rabbittas2739, Commented May 1, 2020 at 18:31
Not really. Are you saying that for each group defined by match_id, at least 1 id and starts with "A" and at least one other id that starts with "B" needs to be present? — Paul H
– Paul H, Commented May 1, 2020 at 19:37
For instance, the group defined by match_id == 3 only has values in the id column that start with "B", so that group is excluded? — Paul H
– Paul H, Commented May 1, 2020 at 19:38

Ben.T · Accepted Answer · 2020-05-01 18:23:46Z

2

Single line seems hard, but you can str.extract the two strings you want from id, then groupby match_id and use any to see if at least a row per match_id will have one of the string you want, then using all with axis 1 will give True to match_id that both strings. Then you can use the series just created to select only True match_id after map match_id column.

s = df_example['id'].str.extract('(A)|(B)').notna()\
                    .groupby(df_example['match_id']).any().all(1)
df_expected = df_example.loc[df_example['match_id'].map(s), :]

print (df_expected)
    id  value  match_id
0  A10  grass         1
3  B17  grass         1
4  A20   tree         2
6  B11  grass         1
8  B56   tree         2

edited May 1, 2020 at 18:23

answered May 1, 2020 at 18:18

Ben.T

29.7k6 gold badges39 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sammywemmy · Accepted Answer · 2020-05-02 00:39:45Z

A different take on @Ben.T's solution :

#create a helper column that combines the letters per gropu
res = (df_example
        #the id column starts with a letter
       .assign(letter = lambda x: x.id.str[0])
       .groupby('match_id')
       .letter.transform(','.join)
      )

df['grp'] = res
df

    id  value   match_id    grp
0   A10 grass   1          A,B,B
1   B45 cow     3          B,B
2   B98 bird    6           B
3   B17 grass   1           A,B,B
4   A20 tree    2           A,B
5   A87 farmer  5          A,A
6   B11 grass   1         A,B,B
7   A33 chicken 4          A
8   B56 tree    2          A,B
9   A23 farmer  5          A,A
10  B65 cow     3          B,B

#filter for grps that contain A and B, and keep only relevant columns
df.loc[df.grp.str.contains('A,B'), "id":"match_id"]

     id value   match_id
0   A10 grass   1
3   B17 grass   1
4   A20 tree    2
6   B11 grass   1
8   B56 tree    2

#or u could use a list comprehension that assures u of both A and B (not just A following B)

filtered = [True if ("A" in ent) and ("B" in ent) else False for ent in df.grp.array]
df.loc[filtered,"id":"match_id"]

     id value   match_id
0   A10 grass   1
3   B17 grass   1
4   A20 tree    2
6   B11 grass   1
8   B56 tree    2

Collectives™ on Stack Overflow

How to filter a pandas dataframe by string values and matching integers in rows?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related