Pandas: Filtering on describe output (count)

Question

I am trying to limit the output returned by the describe output to a subset of only those records with a count great than or equal to any given number.

My dataframe is a subset of a larger one, and is defined as:

 df = evaluations[['score','garden_id']]

When I run describe on this,

df.groupby('garden_id').describe()

I would like to filter the returned data to those records where count > a specified number.

My output looks like:

Out[39]: 
                 score
garden_id             
37254     count      6
          mean      20
          std        0
          min       20
          25%       20
          50%       20
          75%       20
          max       20
37273     count      1
          mean      17
          std      NaN
          min       17
          25%       17
          50%       17
          75%       17
          max       17
37284     count      1
          mean      19
          std      NaN
          min       19
          25%       19
          50%       19
          75%       19
          max       19
37288     count      1
          mean      11
          std      NaN
          min       11
          25%       11
          50%       11
               ...

I was going to try modifying something like: modify-output-from-python-pandas-describe, but I only get this:

Out[40]: 
Empty DataFrame
Columns: [score]
Index: []

when I run df.groupby('garden_id').describe().loc[['count']]

I did try newframe = df.groupby('garden_id').describe().count() > 3, but I get a mask showing which records are true and which false, so not really sure how to use this, either.

Is there a way I can filter out desired records directly using the describe method without having to deal with data masking, etc.?

@jezrael apparently understood what I wanted: I want to set a threshold for the counts in the describe output, and then find out which garden_ids match this criterion. — horcle_buzz
– horcle_buzz, Commented Jun 13, 2016 at 19:24
.@horcle. You are not limiting the output of .describe(). The output are the same. — Merlin
– Merlin, Commented Jun 13, 2016 at 19:31
And, I didn't want just the counts: I also wanted all the other summary statistics associated with each garden_id. — horcle_buzz
– horcle_buzz, Commented Jun 13, 2016 at 20:03
@Merlin, I edited the question to reflect what I really was looking to do. — horcle_buzz
– horcle_buzz, Commented Jun 15, 2016 at 0:54

jezrael · Accepted Answer · 2016-06-13 06:09:44Z

I think you can use slicers for selecting and then boolean indexing for finding index idx1, where mask is True:

import pandas as pd

df = pd.DataFrame({'score':[1,2,3,3,1,2],
                   'garden_id':[1,1,1,1,2,2]})

print (df)
   garden_id  score
0          1      1
1          1      2
2          1      3
3          1      3
4          2      1
5          2      2

newframe = df.groupby('garden_id').describe()
print (newframe)
                    score
garden_id                
1         count  4.000000
          mean   2.250000
          std    0.957427
          min    1.000000
          25%    1.750000
          50%    2.500000
          75%    3.000000
          max    3.000000
2         count  2.000000
          mean   1.500000
          std    0.707107
          min    1.000000
          25%    1.250000
          50%    1.500000
          75%    1.750000
          max    2.000000

idx = pd.IndexSlice
mask = newframe.loc[idx[:,'count'],:] > 3
print (mask)
                 score
garden_id             
1         count   True
2         count  False

idx1 = mask[mask.values].index.get_level_values('garden_id')
print (idx1)
Int64Index([1], dtype='int64', name='garden_id')

print (newframe.loc[idx[idx1,:],:])
                    score
garden_id                
1         count  4.000000
          mean   2.250000
          std    0.957427
          min    1.000000
          25%    1.750000
          50%    2.500000
          75%    3.000000
          max    3.000000

Merlin · Accepted Answer · 2016-06-13 04:15:43Z

0

If you looking for counts of score greater than 3 Try this:

df2 =  df.groupby('garden_id').count()
df2[df2['score']> 3]

answered Jun 13, 2016 at 4:15

Merlin

25.8k44 gold badges141 silver badges213 bronze badges

Collectives™ on Stack Overflow

Pandas: Filtering on describe output (count)

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related