I am trying to subtract every element in the column from its mean and divide by the standard deviation. I did it in two different ways (numeric_data1 and numeric_data2):
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
type(numeric_data1) # -> pandas.core.frame.DataFrame
type(numeric_data2) # -> pandas.core.frame.DataFrame
Both are pandas dataframes and they should have the same result. However, I get different results:
numeric_data2 == numeric_data1 # -> False
I think the problem stems from how numpy and pandas handle numeric precision:
numeric_data.mean() == np.mean(numeric_data, axis=0) # -> True
numeric_data.std(axis=0) == np.std(numeric_data, axis=0) # -> False
For mean numpy and pandas gave me the same thing, but for standard deviation, I got little different results.
Is my assessment correct or am I making some blunder?