3

Consider the following dataset:

a        b
0        23
0        21
1        25
1        20
1        19
2        44
2        11

How can I find the percentages of values in column b which are greater than 20 ,and are in the same cluster according to column a. my code give me the same value for each group.

NN20 = [x for x in b if (x > 20)]
percent_20 = lambda x: float(len(NN20)) / float(len(b))
pnn20=data.groupby('a').apply(percent_20) 

3 Answers 3

4

IIUC:

In [179]: df.groupby('a')['b'].apply(lambda x: x.gt(20).mean())
Out[179]:
a
0    1.000000
1    0.333333
2    0.500000
Name: b, dtype: float64

or

In [183]: df.groupby('a')['b'].transform(lambda x: x.gt(20).mean())
Out[183]:
0    1.000000
1    1.000000
2    0.333333
3    0.333333
4    0.333333
5    0.500000
6    0.500000
Name: b, dtype: float64
Sign up to request clarification or add additional context in comments.

2 Comments

Why do you use mean instead of sum/len?
@AlterNative, it gives us the same result, but it's nicer, shorter and requires less operations... PS actually mean(lst) == sum(lst) / len(lst) ;-)
2

If you need something fast, np.bincount could be a good solution instead of a Pandas groupby.

np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a))

which returns

array([ 1.        ,  0.33333333,  0.5       ])

Or if you wanted to transform the output back to a series, you could subsequently use np.take.

pd.Series((np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a)).take(df.a))

# 0    1.000000
# 1    1.000000
# 2    0.333333
# 3    0.333333
# 4    0.333333
# 5    0.500000
# 6    0.500000
# dtype: float64

In either case, this seems to be quite fast.

Smaller case: provided dataset

groupby approach from MaxU

%timeit df.groupby('a')['b'].transform(lambda x: x.gt(20).mean())
2.51 ms ± 65.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.bincount approach

%timeit pd.Series((np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a)).take(df.a))
271 µs ± 5.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Larger case: generated dataset

df = pd.DataFrame({'a': np.random.randint(0, 10, 100000), 
                   'b': np.random.randint(0, 100, 100000)}).sort_values('a')

groupby approach from MaxU

%timeit df.groupby('a')['b'].transform(lambda x: x.gt(20).mean())
11.3 ms ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.bincount approach

%timeit pd.Series((np.bincount(df.loc[df.b > 20, 'a']) / np.bincount(df.a)).take(df.a))
1.56 ms ± 5.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Comments

1

This is one way to do it (added another value for 0%):

data = pd.DataFrame({'a': [0,0,1,1,1,2,2,3],
                     'b': [23,21,25,20,19,44,11,15]})

data['c'] = data['b'].apply(lambda x: int(x>20))
shareOf20 = data.groupby('a')['c'].sum() / data.groupby('a')['c'].count()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.