0

I'm trying to figure out the probability under a normal distribution in for my data df in python. I'm not experienced with python or programming. The following user-defined function I scraped from this site works, the scipy function does not work...

UDF:

def normal(x,mu,sigma):
    return ( 2.*np.pi*sigma**2. )**-.5 * np.exp( -.5 * (x-mu)**2. / sigma**2. )
df["normprob"] = normal(df["return"],df["meanreturn"],df["sdreturn"])

This scipy function does not work:

df["normdistprob"] = scip.norm.sf(df["return"],df["meanreturn"],df["sdreturn"])

and it returns the following error

C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1815: RuntimeWarning: invalid value encountered in true_divide
  x = np.asarray((x - loc)/scale, dtype=dtyp)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1816: RuntimeWarning: invalid value encountered in greater
  cond0 = self._argcheck(*args) & (scale > 0)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1817: RuntimeWarning: invalid value encountered in greater
  cond1 = self._open_support_mask(x) & (scale > 0)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

Any advice is appreciated. Also to note, the first 20 cells of

df["meanreturn"]

are NA, not sure if that's affecting it.

3
  • yeah, having NA in any math calculation will make it to crash Commented Feb 1, 2018 at 8:32
  • What is your intended way of calculating the probability if the mean is NA? Commented Feb 1, 2018 at 8:34
  • Okay, I thought even though it was the first 20 cells, that wouldn't affect the rest of the dataset, and the first 20 cells of 'df["normdist"]' would simply be NaN as well. Also, from this link stackoverflow.com/questions/25039328/…, it seems that the NaN cells wouldn't matter? Commented Feb 1, 2018 at 8:37

1 Answer 1

0

Not sure if the survival function is what you need. I believe what you're looking for is scipy's pdf function, specifically the pdf for a normal random variable. I tested it against the custom function you used.

>>> from scipy.stats import norm
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'x': [0.6, 0.5, 0.13], 'mu': [0, 1, 1], 'std': [1, 2, 1]})
>>> norm.pdf(df['x'], df['mu'], df['std'])
array([ 0.3332246 ,  0.19333406,  0.27324443])
>>> def normal(x,mu,sigma):
...     return ( 2.*np.pi*sigma**2. )**-.5 * np.exp( -.5 * (x-mu)**2. / sigma**2. )
...
>>> normal(df['x'], df['mu'], df['std'])
0    0.333225
1    0.193334
2    0.273244
dtype: float64

Note that if your mu and std columns are np.nan, then you will get the runtime warnings, but you will still get an output, similar to the custom function.

>>> df = pd.DataFrame({'x': [0.6, 0.5, 0.13], 'mu': [np.nan, 1, 1], 'std': [np.nan, 2, np.nan]})
>>> norm.pdf(df['x'], df['mu'], df['std'])
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1650: RuntimeWarning: invalid value encountered in greater
  cond0 = self._argcheck(*args) & (scale > 0)
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:876: RuntimeWarning: invalid value encountered in greater_equal
  return (self.a <= x) & (x <= self.b)
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:876: RuntimeWarning: invalid value encountered in less_equal
  return (self.a <= x) & (x <= self.b)
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1651: RuntimeWarning: invalid value encountered in greater
  cond1 = self._support_mask(x) & (scale > 0)
array([        nan,  0.19333406,         nan])
>>> normal(df['x'], df['mu'], df['std'])
0         NaN
1    0.193334
2         NaN
dtype: float64

You could avoid the warnings if you set your np.nan values to None:

>>> df = pd.DataFrame({'x': [0.6, 0.5, 0.13], 'mu': [None, 1, 1], 'std': [None, 2, None]})
>>> normal(df['x'], df['mu'], df['std'])
0         NaN
1    0.193334
2         NaN
dtype: float64
>>> norm.pdf(df['x'], df['mu'], df['std'])
array([        nan,  0.19333406,         nan])

Note, I would either remove rows where your meanreturn and sdreturn values are NaN. Otherwise, I would make the assumption that you are looking for the probability of x assuming a standard normal distribution, which you would then have to set the NaN values of meanreturn to 0 and NaN values of sdreturn to 1.

One last comment to add is that if all the rows of your dataframe assume a standard normal distribution for calculating the probability from the pdf, then you don't need to pass the mu column and std column. norm.pdf already assumes a standard normal. In this case, you can just run your code like so:

>>> norm.pdf(df['x'])
array([ 0.3332246 ,  0.35206533,  0.39558542])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.