2

I'd like to sample from from a grouped Pandas DataFrame where the group size is sometimes smaller than the N. In the following example, how could I sample 3 when the group size >= 3, otherwise all members of the group?

I am trying the following, but I get an error saying "Cannot take a larger sample than population when 'replace=False'".

 import pandas as pd

 df = pd.DataFrame({'some_key':[0,0,0,0,0,0,1,2,1,2],
               'val':      [0,1,2,3,4,5,6,7,8,9]})

 gby = df.groupby(['some_key'])

 gby.apply(lambda x: x.sample(n=3)).reset_index(drop=True)
6
  • Please don't edit a solution into your question, post it as an answer instead. Commented Oct 25, 2017 at 19:49
  • actually that was my answer Commented Oct 25, 2017 at 19:53
  • Oops. Misinterpreted the "do you really want to do this?" from SO. Commented Oct 25, 2017 at 19:55
  • sorry @avsmith what do you mean? Commented Oct 25, 2017 at 19:58
  • @Liborio Crossed comments. Was responding to suggestion I should move my answer. You came with a very similar answer at same time as me... Thanks. Commented Oct 25, 2017 at 20:01

3 Answers 3

3

You could do

 gby.apply(lambda x: x.sample(n=3) if x.shape[0]>=3 else x).reset_index(drop=True)

you can use conditional construct in your lambda function

val_if_true if cond else val_if_false
Sign up to request clarification or add additional context in comments.

1 Comment

gby.apply(lambda x: x.sample(min(len(x), 3)))
1

Answering my own question....

I came up with a solution, a bit different than that proposed by Wen.

import pandas as pd

def nsample(x,n):
    if len(x) <= n:
        return x
    else:
        return x.sample(n=n)

df = pd.DataFrame({'some_key':[0,0,0,0,0,0,1,2,1,2],
                   'val':      [0,1,2,3,4,5,6,7,8,9]})

gby = df.groupby(['some_key'])

n_max = 3 
gby.apply(lambda x: nsample(x, n_max)).reset_index(drop=True)

# Alternative with inline lambda
gby.apply(lambda x: x.sample(n= n_max) if len(x)> n_max else x).reset_index(drop=True)

Comments

0

By using head or tail

df.groupby(['some_key']).head(3)
Out[248]: 
   some_key  val
0         0    0
1         0    1
2         0    2
6         1    6
7         2    7
8         1    8
9         2    9

EDIT

l=[]
for _,df1 in df.groupby('some_key'):

    if (len(df1)<3):
        l.append(df1)
    else:
        l.append(df1.sample(3))

pd.concat(l,axis=0)

Out[401]: 
   some_key  val
1         0    1
3         0    3
4         0    4
6         1    6
8         1    8
7         2    7
9         2    9

2 Comments

Thanks. However, I wasn't clear that I wanted to sample among those with group size > N.
@avsmith your solution is nice too, have a nice day

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.