Subtle differences in data calcuations from array vs list [duplicate]

Question

As you can see in the code below, I calculate variance for data in the 'open' column two different ways. The only difference being that in the second version I grab the values rather than the column containing values. Why would this lead to different variance calculations?

apple_prices = pd.read_csv('apple_prices.csv')

print(apple_prices['open'].values.var())
#prints 102.22564310059172

print(apple_prices['open'].var())
#prints 103.82291877403847

Pandas and numpy have different default values for degrees of freedom. — rafaelc
– rafaelc, Commented Oct 13, 2020 at 18:44

Cameron Riddell · Accepted Answer · 2020-10-13 18:43:37Z

2

The reason for the difference is because that pandas.Series.var has a default ddof (delta degrees of freedom) of 1, and numpy.ndarray.var has a default ddof of 0. Manually setting this produces the same result:

import pandas as pd
import numpy as np
np.random.seed(0)

x = pd.Series(np.random.rand(100))

print(x.var(ddof=1))
# 0.08395738934787107


print(x.values.var(ddof=1))
# 0.08395738934787107

See the documentation at:
pandas.Series.var
numpy.var

answered Oct 13, 2020 at 18:43

Cameron Riddell

13.8k14 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Subtle differences in data calcuations from array vs list [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related