5

I have a sample table like this:

Dataframe: df

Col1     Col2    Col3    Col4
A   1   10  i
A   1   11  k
A   1   12  a
A   2   10  w
A   2   11  e
B   1   15  s
B   1   16  d
B   2   21  w
B   2   25  e
B   2   36  q
C   1   23  a
C   1   24  b

I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:

A   2   10  w
A   2   11  e
B   1   15  s
B   1   16  d

since group (A,2) has 2 records compared to group (A,1) which has 3 records.

I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it. Any advise on how to approach this would be greatly appreciated.

I had this to get the groups for which I wanted the rows displayed:

    groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
    print filteredGroups.groupby(level=0).agg('idxmin')

The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.

1
  • Just added an import part of the requirement that I need to not display any groups that only contain one group (C, 1) Commented Mar 20, 2017 at 20:12

4 Answers 4

4
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")

df['rnk']     = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)

df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]

      Col1  Col2  Col3 Col4  sz  rnk  rnk_rev
3    A     2    10    w   2  1.0      4.0
4    A     2    11    e   2  1.0      4.0
5    B     1    15    s   2  1.0      4.0
6    B     1    16    d   2  1.0      4.0

Edit: changed "count" to "size" (as in @Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.

And for clarity, here's what the df looks like before dropping the selected rows.

   Col1  Col2  Col3 Col4  sz  rnk  rnk_rev
0     A     1    10    i   3  3.0      1.0
1     A     1    11    k   3  3.0      1.0
2     A     1    12    a   3  3.0      1.0
3     A     2    10    w   2  1.0      4.0
4     A     2    11    e   2  1.0      4.0
5     B     1    15    s   2  1.0      4.0
6     B     1    16    d   2  1.0      4.0
7     B     2    21    w   3  3.0      1.0
8     B     2    25    e   3  3.0      1.0
9     B     2    36    q   3  3.0      1.0
10    C     1    23    a   2  1.0      1.0
11    C     1    24    b   2  1.0      1.0
Sign up to request clarification or add additional context in comments.

3 Comments

This works perfectly! Thank you and @Marco Spinaci for the original solution.
Could this be modified to skip over groups that only contain one record? I've added extra lines to my table to illustrate. Basically, all col1=='C' lines should be ignored as there is only one group with C in it (C, 2).
Works perfectly! Thank you!
2

Definitely not a nice answer, but it should work:

tmp = df[['col1','col2']].groupby(['col1','col2'], as_index=False).size()
df['occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]][df.col2[i]])
df['min_occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]].min())
df[df.occurrencies == df.min_occurrencies]

But there must be a more clever way to use groupby than creating an auxiliary data frame...

Comments

1

The following is a solution that is based on the groupby.apply methodology. Other simpler methods are available by creating data Series as in JohnE's method which is superior I would say.

The solution works by grouping the dataframe at the Col1 level and then passing a function to apply that further groups the data by Col2. Each sub_group is then assessed to yield the smallest group. Note that ties in size will be determined by whichever is evaluated first. This may not be desirable.

#create data
import pandas as pd 
df = pd.DataFrame({   
"Col1" : ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"Col2" : [1, 1, 1, 2, 2, 1, 1, 2, 2, 2],
"Col3" : [10, 11, 12, 10, 11, 15, 16, 21, 25, 36],
"Col4" : ["i", "k", "a", "w", "e", "s", "d", "w", "e", "q"]
                                    })

Grouped = df.groupby("Col1")

def transFunc(x):
    smallest = [None, None]
    sub_groups = x.groupby("Col2")
    for group, data in sub_groups:
        if not smallest[1] or len(data) < smallest[1]:
            smallest[0] = group
            smallest[1] = len(data)
    return sub_groups.get_group(smallest[0])

Grouped.apply(transFunc).reset_index(drop = True)

Edit to assign the result

result = Grouped.apply(transFunc).reset_index(drop = True)
print(result)

4 Comments

I just tried this code and added print Grouped.head() but got the full df printout back. Did I miss something? Thanks
There is no change to the dataframe or to the Grouped object until you assign it. None of the changes are inplace. In other words you cannot just print Grouped.head as the result of the computation (the last line) has not been assigned to an object. No changes are made directly on Grouped object.
Sorry about that. Indeed, this works perfectly. I tried to print the wrong thing. Thanks
I edited the answer just to be sure that it is clear that in the result of a tie for the smallest group, the group that is processed first will be retained. Im not sure if this is the behavior you want or not.
0

I would like to add a shorter yet readable version of JohnE's solution

df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df.groupby('Col1').filter(lambda x: x['sz'].rank(method='min') == 1 and x['sz'].rank(method='min', ascending=False) != 1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.