3

I have a Data Frame that contains two columns named, "thousands of dollars per year", and "EMPLOY".

I create a new variable in this data frame named "cubic_Root" by computing the data in df['thousands of dollars per year']

df['cubic_Root'] = -1 / df['thousands of dollars per year'] ** (1. / 3)

The data in df['cubic_Root'] like that:

ID cubic_Root

1 -0.629961

2 -0.405480

3 -0.329317

4 -0.480750

5 -0.305711

6 -0.449644

7 -0.449644

8 -0.480750

Now! How can I draw a normal probability plot by using the data in df['cubic_Root'].

1

1 Answer 1

6

You want the "Probability" Plots.

So for a single plot, you'd have something like below.

import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

# 100 values from a normal distribution with a std of 3 and a mean of 0.5
data = 3.0 * np.random.randn(100) + 0.5

counts, start, dx, _ = scipy.stats.cumfreq(data, numbins=20)
x = np.arange(counts.size) * dx + start

plt.plot(x, counts, 'ro')
plt.xlabel('Value')
plt.ylabel('Cumulative Frequency')

plt.show()

enter image description here

If you want to plot a distribution, and you know it, define it as a function, and plot it as so:

import numpy as np
from matplotlib import pyplot as plt

def my_dist(x):
    return np.exp(-x ** 2)

x = np.arange(-100, 100)
p = my_dist(x)
plt.plot(x, p)
plt.show()

If you don't have the exact distribution as an analytical function, perhaps you can generate a large sample, take a histogram and somehow smooth the data:

import numpy as np
from scipy.interpolate import UnivariateSpline
from matplotlib import pyplot as plt

N = 1000
n = N/10
s = np.random.normal(size=N)   # generate your data sample with N elements
p, x = np.histogram(s, bins=n) # bin it into n = N/10 bins
x = x[:-1] + (x[1] - x[0])/2   # convert bin edges to centers
f = UnivariateSpline(x, p, s=n)
plt.plot(x, f(x))
plt.show()

You can increase or decrease s (smoothing factor) within the UnivariateSpline function call to increase or decrease smoothing. For example, using the two you get:

enter image description here

Probability density Function (PDF) of inter-arrival time of events.

import numpy as np
import scipy.stats

# generate data samples
data = scipy.stats.expon.rvs(loc=0, scale=1, size=1000, random_state=123)

A kernel density estimation can then be obtained by simply calling

scipy.stats.gaussian_kde(data,bw_method=bw)

where bw is an (optional) parameter for the estimation procedure. For this data set, and considering three values for bw the fit is as shown below

# test values for the bw_method option ('None' is the default value)
bw_values =  [None, 0.1, 0.01]

# generate a list of kde estimators for each bw
kde = [scipy.stats.gaussian_kde(data,bw_method=bw) for bw in bw_values]


# plot (normalized) histogram of the data
import matplotlib.pyplot as plt 
plt.hist(data, 50, normed=1, facecolor='green', alpha=0.5);

# plot density estimates
t_range = np.linspace(-2,8,200)
for i, bw in enumerate(bw_values):
    plt.plot(t_range,kde[i](t_range),lw=2, label='bw = '+str(bw))
plt.xlim(-1,6)
plt.legend(loc='best')

enter image description here

Reference:

Python: Matplotlib - probability plot for several data set

how to plot Probability density Function (PDF) of inter-arrival time of events?

Sign up to request clarification or add additional context in comments.

2 Comments

Normal probability plot are plotted with Z-scores on Y-axis but here binned vaues of cumulative frequency is used (referring to the single plot).. why is it so ??

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.