Multiple linear regression with numpy

Question

I want to calculate multiple linear regression with numpy. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

For example, with this data:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(output for above:)

y      x1    x2    x3    x4 x5   x6  x7
20.64, 0.0,  296,  54.7, 0, 519, 2,  24.0 
25.12, 0.0,  387,  54.7, 1, 678, 2,  24.0 
19.22, 0.0,  535,  54.7, 0, 296, 2,  24.0 
18.99, 0.0,  519,  18.97, 0, 296, 2,   54.9 
18.89, 0.0,  296,  18.97, 0, 535, 2,   54.9 
25.51, 0.0,  678,  18.97, 1, 387, 2,   54.9 
20.19, 0.0,  296,  25.51,  0,  519,  2,   54.9 
20.75, 0.0,  535,  25.51,  0,  296,  2,   54.9 
24.13, 0.0,  387,  25.51,  1,  678,  2,   54.9 
19.24, 0.0,  519,  0,  0,  296,  2,   55.0 
20.90, 0.0,  296,  0,  0,  535,  2,   55.0 
25.30, 0.0,  678,  0,  1,  387,  2,   55.0 
20.78, 0.0,  296,  0,  0,  519,  2,   55.2 
23.01, 0.0,  535,  0,  0,  296,  2,   55.2 
25.20, 0.0,  387,  0,  1,  678,  2,   55.2 
19.12, 0.0,  519,  0,  0,  296,  2,   55.3 
20.03, 0.0,  296,  0,  0,  535,  2,   55.3 
25.22, 0.0,  678,  0,  1,  387,  2,   55.3

I have created this function that I think it gives the coefficients A from Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c.

def calculate_linear_regression_numpy(xx, yy):
    """ calculate multiple linear regression """
    import numpy as np
    from numpy import linalg

    A = np.column_stack((xx, np.ones(len(xx))))
    coeffs = linalg.lstsq(A, yy)[0]  # obtaining the parameters

    return coeffs

xx is a list that contains each row of x's, and yy is a list that contains all y.

The A is this:

00 = {ndarray} [   0.   296.   519.    2.    0.   24.    54.7    1. ]
01 = {ndarray} [   0.   296.   535.    2.    0.   24.    54.7    1. ]
02 = {ndarray} [   0.   387.   678.    2.    1.   24.    54.7    1. ]
03 = {ndarray} [   0.   296.   519.    2.    0.   54.9   18.97957206    1. ]
04 = {ndarray} [   0.   296.   535.    2.    0.   54.9   18.97957206    1. ]
05 = {ndarray} [   0.   387.   678.    2.    1.   54.9   18.97957206    1. ]
06 = {ndarray} [   0.   296.   519.    2.    0.   54.9   25.518085    1.   ]
07 = {ndarray} [   0.   296.   535.    2.    0.   54.9   25.518085    1.   ]
08 = {ndarray} [   0.   387.   678.    2.    1.   54.9   25.518085    1.   ]
09 = {ndarray} [   0.   296.   519.    2.    0.   55.    0.    1.]
10 = {ndarray} [   0.   296.   535.    2.    0.   55.    0.    1.]
11 = {ndarray} [   0.   387.   678.    2.    1.   55.    0.    1.]
12 = {ndarray} [   0.   296.   519.    2.    0.   55.2   0.    1. ]
13 = {ndarray} [   0.   296.   535.    2.    0.   55.2   0.    1. ]
14 = {ndarray} [   0.   387.   678.    2.    1.   55.2   0.    1. ]
15 = {ndarray} [   0.   296.   519.    2.    0.   55.3   0.    1. ]
16 = {ndarray} [   0.   296.   535.    2.    0.   55.3   0.    1. ]
17 = {ndarray} [   0.   387.   678.    2.    1.   55.3   0.    1. ]

And the np.dot(A,coeffs) is this:

[ 19.69873196  20.33871176  24.95249051  19.59198545
20.23196525  24.845744    19.41602911  20.05600891  24.66978766
20.09928377  20.73926357  25.35304232  20.09237109  20.73235089
25.34612964  20.08891474  20.72889454  25.34267329]

At the return of the function, the coeffs, contains this 8 values.

[0.0, -0.0010535377771944548, 0.039998737474281849, 0.62111016637058492, -1.0101687709958682, -0.034563440146209781, -0.026910757873959575, 0.31055508318529385]

I don't know if the coeffs[0] or the coeffs[7] is the c from the equation Y defined above.

I take this coeffs and I calculate the new Ŷ multiplying the coeffs with the new ẍ's, like this:

Ŷ=a1ẍ1 + a2ẍ2 + a3ẍ3 + a4ẍ4 + a5ẍ5 + a6ẍ6 + +a7ẍ7 + c

Am I calculating Ŷ correctly? And what should I do when I get a Ŷ with a negative number? Which term is the c (a[0] or a[7])?

The c term would be a[7] since you are putting the ones column at the end, but your coefficients doesn't make sense, you can check by doing print np.dot(A,coeffs), it should give you yy, or very similar. When I tried I got the coefficients [ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066] — Noel Segura Meraz
– Noel Segura Meraz, Commented Jan 14, 2016 at 10:23
Look at the x2 and x3 values of row 00, they are 1.10224946e+09 and 4.40557880e+07, which don't appear anywhere on the first group of data you presented. Also there are 18 rows in the first data and 19 in A — Noel Segura Meraz
– Noel Segura Meraz, Commented Jan 14, 2016 at 13:17
If you input the x's values in the right order, then it just means that that is the value your regression is calculating. What a negative Y means depends on what are you calculating. But at equation level, a negative answer is completely valid — Noel Segura Meraz
– Noel Segura Meraz, Commented Jan 14, 2016 at 13:31

Chris · Accepted Answer · 2016-01-14 10:10:16Z

1

The columns keep the order you specify them in, otherwise you would be unable to use the coefficients!

Remember, from the matrix form of the least squares problem, your estimate of Y is given by A dot C where C is your coefficient vector/matrix.

So, print out A, and it should be in the form of X1....X7 [Column of Ones].

whichever column number contains your ones, is the equivalent entry in the coefficient vector for your offset coefficient.

Just by the size of the parameters coeff[7] looks to be the offset, as it is orders of magnitude larger, which doesn't look logical as a multiplicative coefficient given the X and Y values you supplied.

answered Jan 14, 2016 at 10:10

Chris

9675 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

xeon123 Over a year ago

And adding the difference between the previously predicted value Ŷ and the real value Y to the new Ŷ in order to minimize the error that exist in the new prediction makes sense?

Chris Over a year ago

Can you add what your A matrix looks like. Also, Adding the residual does not really make sense. By definition, the model fits the least overall error to the data on the first step. What you should do is plot your residuals. If they look random, you will not get better. If the seem to have some structure, you need to look at a different model form (e.g. non linear regresssion).

Collectives™ on Stack Overflow

Multiple linear regression with numpy

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related