0

As you can see in the code below, I calculate variance for data in the 'open' column two different ways. The only difference being that in the second version I grab the values rather than the column containing values. Why would this lead to different variance calculations?

apple_prices = pd.read_csv('apple_prices.csv')

print(apple_prices['open'].values.var())
#prints 102.22564310059172

print(apple_prices['open'].var())
#prints 103.82291877403847
1
  • Pandas and numpy have different default values for degrees of freedom. Commented Oct 13, 2020 at 18:44

1 Answer 1

2

The reason for the difference is because that pandas.Series.var has a default ddof (delta degrees of freedom) of 1, and numpy.ndarray.var has a default ddof of 0. Manually setting this produces the same result:

import pandas as pd
import numpy as np
np.random.seed(0)

x = pd.Series(np.random.rand(100))

print(x.var(ddof=1))
# 0.08395738934787107


print(x.values.var(ddof=1))
# 0.08395738934787107

See the documentation at:
pandas.Series.var
numpy.var

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.