Groupby in pandas dataframe

Question

Consider the following dataset:

a        b
0        23
0        21
1        25
1        20
1        19
2        44
2        11

How can I find the percentages of values in column b which are greater than 20 ,and are in the same cluster according to column a. my code give me the same value for each group.

NN20 = [x for x in b if (x > 20)]
percent_20 = lambda x: float(len(NN20)) / float(len(b))
pnn20=data.groupby('a').apply(percent_20)

MaxU - stand with Ukraine · Accepted Answer · 2017-11-01 22:18:27Z

4

IIUC:

In [179]: df.groupby('a')['b'].apply(lambda x: x.gt(20).mean())
Out[179]:
a
0    1.000000
1    0.333333
2    0.500000
Name: b, dtype: float64

or

In [183]: df.groupby('a')['b'].transform(lambda x: x.gt(20).mean())
Out[183]:
0    1.000000
1    1.000000
2    0.333333
3    0.333333
4    0.333333
5    0.500000
6    0.500000
Name: b, dtype: float64

answered Nov 1, 2017 at 22:18

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Elham Over a year ago

Why do you use mean instead of sum/len?

MaxU - stand with Ukraine Over a year ago

@AlterNative, it gives us the same result, but it's nicer, shorter and requires less operations... PS actually mean(lst) == sum(lst) / len(lst) ;-)

miradulo · Accepted Answer · 2017-11-01 23:22:25Z

If you need something fast, np.bincount could be a good solution instead of a Pandas groupby.

np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a))

which returns

array([ 1.        ,  0.33333333,  0.5       ])

Or if you wanted to transform the output back to a series, you could subsequently use np.take.

pd.Series((np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a)).take(df.a))

# 0    1.000000
# 1    1.000000
# 2    0.333333
# 3    0.333333
# 4    0.333333
# 5    0.500000
# 6    0.500000
# dtype: float64

In either case, this seems to be quite fast.

Smaller case: provided dataset

groupby approach from MaxU

%timeit df.groupby('a')['b'].transform(lambda x: x.gt(20).mean())
2.51 ms ± 65.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.bincount approach

%timeit pd.Series((np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a)).take(df.a))
271 µs ± 5.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Larger case: generated dataset

df = pd.DataFrame({'a': np.random.randint(0, 10, 100000), 
                   'b': np.random.randint(0, 100, 100000)}).sort_values('a')

groupby approach from MaxU

%timeit df.groupby('a')['b'].transform(lambda x: x.gt(20).mean())
11.3 ms ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.bincount approach

%timeit pd.Series((np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a)).take(df.a))
1.56 ms ± 5.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

zipa · Accepted Answer · 2017-11-01 22:20:26Z

1

This is one way to do it (added another value for 0%):

data = pd.DataFrame({'a': [0,0,1,1,1,2,2,3],
                     'b': [23,21,25,20,19,44,11,15]})

data['c'] = data['b'].apply(lambda x: int(x>20))
shareOf20 = data.groupby('a')['c'].sum() / data.groupby('a')['c'].count()

answered Nov 1, 2017 at 22:20

zipa

28k6 gold badges45 silver badges62 bronze badges

Collectives™ on Stack Overflow

Groupby in pandas dataframe

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related