Different result for std between pandas and numpy

Question

I am trying to subtract every element in the column from its mean and divide by the standard deviation. I did it in two different ways (numeric_data1 and numeric_data2):

import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
                 numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
                 np.std(numeric_data, axis=0))

type(numeric_data1)  # -> pandas.core.frame.DataFrame
type(numeric_data2)  # -> pandas.core.frame.DataFrame

Both are pandas dataframes and they should have the same result. However, I get different results:

numeric_data2 == numeric_data1  # -> False

I think the problem stems from how numpy and pandas handle numeric precision:

numeric_data.mean() == np.mean(numeric_data, axis=0)      # -> True
numeric_data.std(axis=0) == np.std(numeric_data, axis=0)  # -> False

For mean numpy and pandas gave me the same thing, but for standard deviation, I got little different results.

Is my assessment correct or am I making some blunder?

Possible duplicate of Calculate numpy.std of each pandas.DataFrame's column? — Cristian Ciupitu
– Cristian Ciupitu, Commented Mar 23, 2018 at 22:00

Cristian Ciupitu · Accepted Answer · 2018-03-23 21:51:47Z

10

When calculating the standard deviation it matters whether you are estimating the standard deviation of an entire population with a smaller sample of that population or are you calculating the standard deviation of the entire population.

If it is a smaller sample of a larger population, you need what is called the sample standard deviation. As it turns out, when you divide the sum of squared differences from the mean by the number of observations, you end up with a biased estimator. We correct for that by dividing by one less than the number of observations. We control for this with the argument ddof=1 for sample standard deviation or ddof=0 for population standard deviation.

Truth is, it doesn't matter much if your sample size is large. But you will see small differences.

Use the degrees of freedom argument in your pandas.DataFrame.std call:

import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
                 numeric_data.std(ddof=0))  # <<<
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
                 np.std(numeric_data, axis=0))

np.isclose(numeric_data1, numeric_data2).all()  # -> True

Or in the np.std call:

import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
                 numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
                 np.std(numeric_data, axis=0, ddof=1))  # <<<

np.isclose(numeric_data1, numeric_data2).all()  # -> True

edited Mar 23, 2018 at 21:51

Cristian Ciupitu

21k7 gold badges56 silver badges80 bronze badges

answered Sep 6, 2017 at 20:05

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

psimeson Over a year ago

Could you please explain it more?

Scott Boston Over a year ago

Look at the docs for numpy.std see the ddof default = zero, and pandas.DataFrame.std see ddof default = one.

psimeson Over a year ago

Thanks @ScottBoston it makes sense. I had no idea about ddof

piRSquared Over a year ago

There you go, more explanation.

Collectives™ on Stack Overflow

Different result for std between pandas and numpy

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related