Python Pandas: How to sample when grouped and N > group size?

Question

I'd like to sample from from a grouped Pandas DataFrame where the group size is sometimes smaller than the N. In the following example, how could I sample 3 when the group size >= 3, otherwise all members of the group?

I am trying the following, but I get an error saying "Cannot take a larger sample than population when 'replace=False'".

 import pandas as pd

 df = pd.DataFrame({'some_key':[0,0,0,0,0,0,1,2,1,2],
               'val':      [0,1,2,3,4,5,6,7,8,9]})

 gby = df.groupby(['some_key'])

 gby.apply(lambda x: x.sample(n=3)).reset_index(drop=True)

Please don't edit a solution into your question, post it as an answer instead. — mbrig
– mbrig, Commented Oct 25, 2017 at 19:49
Oops. Misinterpreted the "do you really want to do this?" from SO. — avsmith
– avsmith, Commented Oct 25, 2017 at 19:55
@Liborio Crossed comments. Was responding to suggestion I should move my answer. You came with a very similar answer at same time as me... Thanks. — avsmith
– avsmith, Commented Oct 25, 2017 at 20:01

00__00__00 · Accepted Answer · 2017-10-25 19:44:51Z

3

You could do

 gby.apply(lambda x: x.sample(n=3) if x.shape[0]>=3 else x).reset_index(drop=True)

you can use conditional construct in your lambda function

val_if_true if cond else val_if_false

answered Oct 25, 2017 at 19:44

00__00__00

5,39712 gold badges48 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alaa M. Over a year ago

gby.apply(lambda x: x.sample(min(len(x), 3)))

avsmith · Accepted Answer · 2017-10-25 19:56:20Z

1

Answering my own question....

I came up with a solution, a bit different than that proposed by Wen.

import pandas as pd

def nsample(x,n):
    if len(x) <= n:
        return x
    else:
        return x.sample(n=n)

df = pd.DataFrame({'some_key':[0,0,0,0,0,0,1,2,1,2],
                   'val':      [0,1,2,3,4,5,6,7,8,9]})

gby = df.groupby(['some_key'])

n_max = 3 
gby.apply(lambda x: nsample(x, n_max)).reset_index(drop=True)

# Alternative with inline lambda
gby.apply(lambda x: x.sample(n= n_max) if len(x)> n_max else x).reset_index(drop=True)

answered Oct 25, 2017 at 19:56

avsmith

637 bronze badges

Comments

BENY · Accepted Answer · 2017-10-25 19:15:20Z

0

By using head or tail

df.groupby(['some_key']).head(3)
Out[248]: 
   some_key  val
0         0    0
1         0    1
2         0    2
6         1    6
7         2    7
8         1    8
9         2    9

EDIT

l=[]
for _,df1 in df.groupby('some_key'):

    if (len(df1)<3):
        l.append(df1)
    else:
        l.append(df1.sample(3))

pd.concat(l,axis=0)

Out[401]: 
   some_key  val
1         0    1
3         0    3
4         0    4
6         1    6
8         1    8
7         2    7
9         2    9

edited Oct 25, 2017 at 19:15

answered Oct 25, 2017 at 15:50

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

avsmith Over a year ago

Thanks. However, I wasn't clear that I wanted to sample among those with group size > N.

BENY Over a year ago

@avsmith your solution is nice too, have a nice day

Collectives™ on Stack Overflow

Python Pandas: How to sample when grouped and N > group size?

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related