4

Thank you for your help. I am still relatively new to pandas and do not observe this specific kind of query in search results.

I have a pandas dataframe:

+-----+---------+----------+
| id  |  value  | match_id |
+-----+---------+----------+
| A10 | grass   |        1 |
| B45 | cow     |        3 |
| B98 | bird    |        6 |
| B17 | grass   |        1 |
| A20 | tree    |        2 |
| A87 | farmer  |        5 |
| B11 | grass   |        1 |
| A33 | chicken |        4 |
| B56 | tree    |        2 |
| A23 | farmer  |        5 |
| B65 | cow     |        3 |
+-----+---------+----------+

I need to filter this dataframe for rows that contain matching match_id values, with the condition that the id column must also contain both strings A and B.

This is the expected output:

+-----+-------+----------+
| id  | value | match_id |
+-----+-------+----------+
| A10 | grass |        1 |
| B17 | grass |        1 |
| A20 | tree  |        2 |
| B11 | grass |        1 |
| B56 | tree  |        2 |
+-----+-------+----------+

How can I do this in, say, a single line of pandas code? Reproducible program below:

import pandas as pd

data_example = {'id': ['A10', 'B45', 'B98', 'B17', 'A20', 'A87', 'B11', 'A33', 'B56', 'A23', 'B65'], 
                'value': ['grass', 'cow', 'bird', 'grass', 'tree', 'farmer', 'grass', 'chicken', 'tree', 'farmer', 'cow'], 
                'match_id': [1, 3, 6, 1, 2, 5, 1, 4, 2, 5, 3]}
df_example = pd.DataFrame(data=data_example)

data_expected = {'id': ['A10', 'B17', 'A20', 'B11', 'B56'], 
                'value': ['grass', 'grass', 'tree', 'grass', 'tree'], 
                'match_id': [1, 1, 2, 1, 2]}
df_expected = pd.DataFrame(data=data_expected)

Thank you!

5
  • 2
    This is an excellently posed question. Thanks for taking the time to pull together the date and examples in a runnable format. Second part is pretty easy, but the first is trickier. Stand by. Commented May 1, 2020 at 18:15
  • Why does the row with B56,tree,2 get included in the final output? While the ID contains B, it doesn't also contain 2 Commented May 1, 2020 at 18:22
  • @PaulH Thank you, I mean to filter by two conditions: 1.) by rows in match_id column that have matching integers, and 2.) by rows in id column that contain string values both A and B per rows of matching match_id rows. Is this helpful? Commented May 1, 2020 at 18:31
  • Not really. Are you saying that for each group defined by match_id, at least 1 id and starts with "A" and at least one other id that starts with "B" needs to be present? Commented May 1, 2020 at 19:37
  • For instance, the group defined by match_id == 3 only has values in the id column that start with "B", so that group is excluded? Commented May 1, 2020 at 19:38

2 Answers 2

2

Single line seems hard, but you can str.extract the two strings you want from id, then groupby match_id and use any to see if at least a row per match_id will have one of the string you want, then using all with axis 1 will give True to match_id that both strings. Then you can use the series just created to select only True match_id after map match_id column.

s = df_example['id'].str.extract('(A)|(B)').notna()\
                    .groupby(df_example['match_id']).any().all(1)
df_expected = df_example.loc[df_example['match_id'].map(s), :]

print (df_expected)
    id  value  match_id
0  A10  grass         1
3  B17  grass         1
4  A20   tree         2
6  B11  grass         1
8  B56   tree         2
Sign up to request clarification or add additional context in comments.

Comments

1

A different take on @Ben.T's solution :

#create a helper column that combines the letters per gropu
res = (df_example
        #the id column starts with a letter
       .assign(letter = lambda x: x.id.str[0])
       .groupby('match_id')
       .letter.transform(','.join)
      )

df['grp'] = res
df

    id  value   match_id    grp
0   A10 grass   1          A,B,B
1   B45 cow     3          B,B
2   B98 bird    6           B
3   B17 grass   1           A,B,B
4   A20 tree    2           A,B
5   A87 farmer  5          A,A
6   B11 grass   1         A,B,B
7   A33 chicken 4          A
8   B56 tree    2          A,B
9   A23 farmer  5          A,A
10  B65 cow     3          B,B

#filter for grps that contain A and B, and keep only relevant columns
df.loc[df.grp.str.contains('A,B'), "id":"match_id"]

     id value   match_id
0   A10 grass   1
3   B17 grass   1
4   A20 tree    2
6   B11 grass   1
8   B56 tree    2

#or u could use a list comprehension that assures u of both A and B (not just A following B)

filtered = [True if ("A" in ent) and ("B" in ent) else False for ent in df.grp.array]
df.loc[filtered,"id":"match_id"]

     id value   match_id
0   A10 grass   1
3   B17 grass   1
4   A20 tree    2
6   B11 grass   1
8   B56 tree    2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.