Python pandas - select rows based on groupby

Question

I have a sample table like this:

Dataframe: df

Col1     Col2    Col3    Col4
A   1   10  i
A   1   11  k
A   1   12  a
A   2   10  w
A   2   11  e
B   1   15  s
B   1   16  d
B   2   21  w
B   2   25  e
B   2   36  q
C   1   23  a
C   1   24  b

I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:

A   2   10  w
A   2   11  e
B   1   15  s
B   1   16  d

since group (A,2) has 2 records compared to group (A,1) which has 3 records.

I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it. Any advise on how to approach this would be greatly appreciated.

I had this to get the groups for which I wanted the rows displayed:

    groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
    print filteredGroups.groupby(level=0).agg('idxmin')

The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.

Just added an import part of the requirement that I need to not display any groups that only contain one group (C, 1) — Ant Smith
– Ant Smith, Commented Mar 20, 2017 at 20:12

JohnE · Accepted Answer · 2017-03-20 20:42:02Z

4

df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")

df['rnk']     = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)

df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]

      Col1  Col2  Col3 Col4  sz  rnk  rnk_rev
3    A     2    10    w   2  1.0      4.0
4    A     2    11    e   2  1.0      4.0
5    B     1    15    s   2  1.0      4.0
6    B     1    16    d   2  1.0      4.0

Edit: changed "count" to "size" (as in @Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.

And for clarity, here's what the df looks like before dropping the selected rows.

   Col1  Col2  Col3 Col4  sz  rnk  rnk_rev
0     A     1    10    i   3  3.0      1.0
1     A     1    11    k   3  3.0      1.0
2     A     1    12    a   3  3.0      1.0
3     A     2    10    w   2  1.0      4.0
4     A     2    11    e   2  1.0      4.0
5     B     1    15    s   2  1.0      4.0
6     B     1    16    d   2  1.0      4.0
7     B     2    21    w   3  3.0      1.0
8     B     2    25    e   3  3.0      1.0
9     B     2    36    q   3  3.0      1.0
10    C     1    23    a   2  1.0      1.0
11    C     1    24    b   2  1.0      1.0

edited Mar 20, 2017 at 20:42

answered Mar 20, 2017 at 17:58

JohnE

30.7k9 gold badges86 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ant Smith Over a year ago

This works perfectly! Thank you and @Marco Spinaci for the original solution.

Ant Smith Over a year ago

Could this be modified to skip over groups that only contain one record? I've added extra lines to my table to illustrate. Basically, all col1=='C' lines should be ignored as there is only one group with C in it (C, 2).

Ant Smith Over a year ago

Works perfectly! Thank you!

Marco Spinaci · Accepted Answer · 2017-03-20 17:44:22Z

2

Definitely not a nice answer, but it should work:

tmp = df[['col1','col2']].groupby(['col1','col2'], as_index=False).size()
df['occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]][df.col2[i]])
df['min_occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]].min())
df[df.occurrencies == df.min_occurrencies]

But there must be a more clever way to use groupby than creating an auxiliary data frame...

answered Mar 20, 2017 at 17:44

Marco Spinaci

1,8791 gold badge17 silver badges23 bronze badges

Comments

Woody Pride · Accepted Answer · 2017-03-20 19:51:26Z

1

The following is a solution that is based on the groupby.apply methodology. Other simpler methods are available by creating data Series as in JohnE's method which is superior I would say.

The solution works by grouping the dataframe at the Col1 level and then passing a function to apply that further groups the data by Col2. Each sub_group is then assessed to yield the smallest group. Note that ties in size will be determined by whichever is evaluated first. This may not be desirable.

#create data
import pandas as pd 
df = pd.DataFrame({   
"Col1" : ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"Col2" : [1, 1, 1, 2, 2, 1, 1, 2, 2, 2],
"Col3" : [10, 11, 12, 10, 11, 15, 16, 21, 25, 36],
"Col4" : ["i", "k", "a", "w", "e", "s", "d", "w", "e", "q"]
                                    })

Grouped = df.groupby("Col1")

def transFunc(x):
    smallest = [None, None]
    sub_groups = x.groupby("Col2")
    for group, data in sub_groups:
        if not smallest[1] or len(data) < smallest[1]:
            smallest[0] = group
            smallest[1] = len(data)
    return sub_groups.get_group(smallest[0])

Grouped.apply(transFunc).reset_index(drop = True)

Edit to assign the result

result = Grouped.apply(transFunc).reset_index(drop = True)
print(result)

edited Mar 20, 2017 at 19:51

answered Mar 20, 2017 at 18:20

Woody Pride

14k10 gold badges51 silver badges64 bronze badges

4 Comments

Ant Smith Over a year ago

I just tried this code and added print Grouped.head() but got the full df printout back. Did I miss something? Thanks

Woody Pride Over a year ago

There is no change to the dataframe or to the Grouped object until you assign it. None of the changes are inplace. In other words you cannot just print Grouped.head as the result of the computation (the last line) has not been assigned to an object. No changes are made directly on Grouped object.

Ant Smith Over a year ago

Sorry about that. Indeed, this works perfectly. I tried to print the wrong thing. Thanks

Woody Pride Over a year ago

I edited the answer just to be sure that it is clear that in the result of a tie for the smallest group, the group that is processed first will be retained. Im not sure if this is the behavior you want or not.

Lộc Đoàn · Accepted Answer · 2022-03-07 09:51:20Z

0

I would like to add a shorter yet readable version of JohnE's solution

df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df.groupby('Col1').filter(lambda x: x['sz'].rank(method='min') == 1 and x['sz'].rank(method='min', ascending=False) != 1)

answered Mar 7, 2022 at 9:51

Lộc Đoàn

1

Collectives™ on Stack Overflow

Python pandas - select rows based on groupby

4 Answers 4

3 Comments

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related