specifying "skip NA" when calculating mean of the column in a data frame created by Pandas

Question

I am learning Pandas package by replicating the outing from some of the R vignettes. Now I am using the dplyr package from R as an example:

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

R script

planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
  count = n(),
  dist = mean(Distance, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

Python script

planes = hflights.groupby('TailNum')
planes['Distance'].agg({'count' : 'count',
                        'dist' : 'mean'})

How can I state explicitly in python that NA needs to be skipped?

FooBar · Accepted Answer · 2018-07-12 09:25:17Z

32

That's a trick question, since you don't do that. Pandas will automatically exclude NaN numbers from aggregation functions. Consider my df:

    b   c   d  e
a               
2   2   6   1  3
2   4   8 NaN  7
2   4   4   6  3
3   5 NaN   2  6
4 NaN NaN   4  1
5   6   2   1  8
7   3   2   4  7
9   6   1 NaN  1
9 NaN NaN   9  3
9   3   4   6  1

The internal count() function will ignore NaN values, and so will mean(). The only point where we get NaN, is when the only value is NaN. Then, we take the mean value of an empty set, which turns out to be NaN:

In[335]: df.groupby('a').mean()
Out[333]: 
          b    c    d         e
a                              
2  3.333333  6.0  3.5  4.333333
3  5.000000  NaN  2.0  6.000000
4       NaN  NaN  4.0  1.000000
5  6.000000  2.0  1.0  8.000000
7  3.000000  2.0  4.0  7.000000
9  4.500000  2.5  7.5  1.666667

Aggregate functions work in the same way:

In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]: 
        foo
a          
2  3.333333
3  5.000000
4       NaN
5  6.000000
7  3.000000
9  4.500000

Addendum: Notice how the standard dataframe.mean API will allow you to control inclusion of NaN values, where the default is exclusion.

edited Jul 12, 2018 at 9:25

answered Jul 30, 2014 at 14:49

FooBar

16.7k20 gold badges94 silver badges188 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dr_Zaszuś Over a year ago

Thanks, and how do I do the opposite: make the pandas include NaN?

FooBar Over a year ago

@Dr_Zaszuś have a look at the last line, which links to the manual. It lists the option of including NaN. You can build on top of that as the other answer suggests.

Community · Accepted Answer · 2017-05-23 12:17:53Z

8

What foobar said is true in regards to how it was implemented by default, but there is a very easy way to specify skipna. Here is an exemple that speaks for itself:

def custom_mean(df):
    return df.mean(skipna=False)

group.agg({"your_col_name_to_be_aggregated":custom_mean})

That's it! You can customize your own aggregation the way you want, and I'd expect this to be fairly efficient, but I did not dig into it.

It was also discussed here, but I thought I'd help spread the good news! Answer was found in the official doc.

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered Apr 12, 2017 at 17:10

c-a

6,0681 gold badge19 silver badges12 bronze badges

2 Comments

c-a Over a year ago

@lokheart, this might interest you.

GitHunter0 Over a year ago

Why np.mean does not work?

Collectives™ on Stack Overflow

specifying "skip NA" when calculating mean of the column in a data frame created by Pandas

R script

Python script

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

R script

Python script

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related