6

I'm trying to conduct the residual analysis for simple linear regression. I need to prove that the residuals follow an approximate Normal Distribution.

The csv file I'm using has values for Percentage of marks in Grade 10 and the Salary the student makes.
Once I run the below code, my plot looks like this:

Plot of residuals from code

The plot in the book looks like this:

Plot of residuals from the book

I was expecting my plot to show up like the book as the data is the same. I have double-checked to make sure I'm not missing any data etc. I have split the data set into training and test as per the book as well.

Data is as follows:

Percentage Salary
62  270000
76.33   200000
72  240000
60  250000
61  180000
55  300000
70  260000
68  235000
82.8    425000
59  240000
58  250000
60  180000
66  428000
83  450000
68  300000
37.33   240000
79  252000
68.4    280000
70  231000
59  224000
63  120000
50  260000
69  300000
52  120000
49  120000
64.6    250000
50  180000
74  218000
58  360000
67  150000
75  250000
60  200000
55  300000
78  330000
50.08   265000
56  340000
68  177600
52  236000
54  265000
52  200000
76  393000
64.8    360000
74.4    300000
74.5    250000
73.5    360000
57.58   180000
68  180000
69  270000
66  240000
60.8    300000

The code is below:

# Importing all required libraries for building the regression model
import pandas as pd import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load the dataset into dataframe
mba_salary_df = pd.read_csv( 'MBA Salary.csv' )

# Add constant term of 1 to the dataset
X = sm.add_constant( mba_salary_df[‘Percentage in Grade 10’] )
Y = mba_salary_df['Salary']

# Split dataset into train and test set into 80:20 respectively
train_X, test_X, train_y, test_y = train_test_split( X, Y, train_size = 0.8,random_state = 100 )

# Fit the regression model
mba_salary_lm = sm.OLS( train_y, train_X ).fit()

mba_salary_resid = mba_salary_lm.resid 

probplot = sm.ProbPlot(mba_salary_resid) 

plt.figure( figsize = (8, 6) ) 

probplot.ppplot(line='45') 

plt.title("Normal P-P Plot of Regression Standardized Residuals") 

plt.show()
8
  • Maybe first use print() (and print(type(...)), print(len(...)), etc.) to see which part of code is executed and what you really have in variables. It is called "print debugging" and it helps to see what code is really doing. Commented Apr 13 at 0:51
  • what book do you means? Commented Apr 13 at 0:51
  • Please, provide a minimal reproducible example. We can't really help you if we have to spend too much time guessing what your actual code is (your actual code certainly doesn't have 2 import on the same line, nor does it use fancy quotes instead of regular ' ones, and probably have a Y variable defined somewhere, and a X variable with a column name that matches one of the column names of the data) Commented Apr 13 at 3:23
  • @furas, I'm learning from a book and the snip of the plot is from the book. Commented Apr 13 at 5:10
  • @chrslg code now fixed. Thanks Commented Apr 13 at 5:11

2 Answers 2

4

So, if I understand correctly, you are trying to get the residual part of a linear regression (so the error) on your training dataset, and check if the distribution of that residual part follows a normal law.

But ppplot or qqplot need to know which law you want to compare your dataset against.

As you probably understand, what it does is, for each sample data s, plot a point whose x coordinate is the theoretical CDF of a distribution, and y coordinate is the experimental CDF (so which proportion of s in your dataset are lower that this s).

If you don't specify a distribution function, then, a centered and reduced normal law (μ=0, σ=1) is used by default. But your residual data have a scale way bigger than a standard deviation of 1. So, in practice, all your residuals are very very negative, or very very positive (from a standpoint of a N(0,1) law). I mean by that that either s is so negative than ℙ(X<s) is practically 0, or s is so positive that ℙ(X<s) is practically 1. (for X~N(0,1) that happens for any s lower than -3, or greater than +3, roughly. As you know ℙ(X<2.58)=99%... And 2.58 or 3 is very small compared to your values).

So, in short, you need to say against which law you want to test your residual. If you don't, the default is a N(0,1) law that is obviously not similar at all to the distribution of your residuals (in other words: it works! and the pp-plot being bad indicates exactly what it is supposed to indicate: that, no, your residual does not follow at all a N(0,1) law).

If you have no idea of that law (well, you already said you wanted to test against a normal law), maybe you want to fit one. Either by centering/reducing your data before (to that they are indeed supposed to follow approx a N(0,1)). Or by computing mean and stdev of your residual data and pass them as loc and scale argument to ProbPlot. Or create a normal law yourself (sta.norm(mean, stdev)) and pass that law to ProbPlot

Or, even simpler, ask ProbPlot to fit the parameters for you (in which case, it fits a normal law. You can't choose another kind of distribution, like a weibull or Cauchy or... ; but I understand you don't want to anyway)

So, long story short, if I understand correctly what you want to do

probplot = sm.ProbPlot(mba_salary_resid, fit=True) 

is probably what your want enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Hi Chris! yes that is what I was looking for! I understand now the data I was plotting was not adhering to a said 'Distribution'. By using the fit = True option, Python converted the training residuals to their z values, which has mean =0 and standard deviation=1. Thanks again!
2

It's hard to reproduce exactly what you have due to many things to fix in your code. However, just looking at your plot and what "the book" offers, I think your problem is the use of ppplot instead of qqplot . So change

probplot.ppplot(line='45')

to

probplot.qqplot(line='45')

1 Comment

Hi Toonia, Thank you for answering. Unfortunately changing ppplot to qqplot has plotted the residuals vertically.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.