2

I am performing some tests on the normality of the numpy random generator. Running the following code, the stats.normaltest shows some seeds with low pvalue (which highlights non-normal distribution).

import numpy as np
from scipy import stats

for i in range(100):
    rng = np.random.default_rng(i)
    x = rng.standard_normal(100000)
    if stats.normaltest(x).pvalue < 0.05:
        print(i)

How can this be?

0

2 Answers 2

3

You run 100 tests at 5 % significance even with perfect normal data, 5 will fail by chance. With n = 100 000, the normality test is hypersensitive and will flag tiny random deviations. If you just want to stop seeing spurious fails lower your sample size (like n=1000 instead of 100000).

Sign up to request clarification or add additional context in comments.

Comments

2

I guess that you question hilights a statistical doubt rather than a numpy issue.

As @Shadrack Sylvestar Mbwagha mentioned, in statistics there's something called Multiple comparisons problem: in practice, under the null hypothesys (in your case under the hypothesis that the data you're sampling actually follows a gaussian distribution) the p-value follows an uniform distribution in [0, 1]. This means that if you choose alpha=0.05, you'will have a 5% probability of rejecting the null hypothesis even when it is true.

If you repeat your experiment with a consistent number of replicas you can easily appreciate this:

import numpy as np
from scipy import stats
from matplotlib import pyplot as plt

p_vals = []
for i in range(10_000):
    rng = np.random.default_rng(i)
    x = rng.standard_normal(100000)
    p_vals.append(stats.normaltest(x).pvalue)
p_vals = np.array(p_vals)

plt.hist(p_vals, bins=np.linspace(0,1,21))
plt.xlabel("p-value")
plt.ylabel("frequency")
plt.show()

enter image description here

Interestingly you see that the p-values follow a [0, 1] uniform distribution and (p_vals<0.05).mean() (i.e. the proportion of significant p-values) is actually very close to 0.05

For this reason, in multiple testing you want to adjust your p-values to correct the FDR (false discovery rate). If you correct your p-value with the Benjamini-Hochberg method you will indeed have the following output:

q_vals = stats.false_discovery_control(p_vals, method="bh")
plt.hist(q_vals, bins=np.linspace(0,1,21))
plt.xlabel("adjusted p-value")
plt.ylabel("frequency")

enter image description here

This hilights that, once you've adjusted your p-values for your multiple tests, the distribution you're sempling from is definitely a Normal distribution since you can't reject the null hypothesis.

Let me know if you've other doubts.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.