Why same learning rate for slope and intercept not working in Linear regression?

Question

I'm a new student in AI, currently learning linear regression. I used the california housing dataset for doing my experiments. My goal is to predict the 'population' column based on the 'total_rooms' column. I used the following formula and code to compute the slope 'm' and intercept 'b'.

Formula: $$ m = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2 }$$ $$ c = \bar{y} - m \bar{x} $$

The code is as follows and it works perfectly:

# Linear regression using the above formula
x_vals = np.array(train['total_rooms'])
y_vals = np.array(train['population'])
xm = np.mean(x_vals)
ym = np.mean(y_vals)
def compute_m(x_vals, y_vals):    
    n = len(x_vals)
    sum_xy, sum_xx = 0, 0
    for i in range(n):
        sum_xy += (x_vals[i]-xm) * (y_vals[i]-ym)
        sum_xx += (x_vals[i]-xm)**2
    return sum_xy/sum_xx

m = compute_m(x_vals, y_vals)
c = ym - m*xm

xl = np.array([np.min(x_vals), np.max(x_vals)])
yl = m*xl + c
plt.scatter(x_vals, y_vals)
plt.plot(xl,yl, 'r')
plt.show()
print('m, c:', m,c)

To verify that my code is working. I also checked it with built-in linear regression in scikit-learn and it returns the exact same answer:

# Now use scikit learn library
from sklearn.linear_model import LinearRegression
model = LinearRegression()
ans = model.fit(x_vals.reshape(-1,1),y_vals.reshape(-1,1))
score, intercept, coef = ans.score(x_vals.reshape(-1,1),y_vals.reshape(-1,1)), ans.intercept_, ans.coef_

print('results from scikit-learn:', score, intercept, coef)

But the problem arises when I try to use gradient descent for learning the slope m and intercept c. My code is as follows:

# Gradient descent
def loss_func(y, ypred):
    mse = (y - ypred)**2
    return mse

def gradient_loss(y, x, mc, bc):    
    n = len(y)
    print(n)
    m_loss, b_loss = 0, 0
    for i in range(n):
        ml = (-2/n) * x[i] * (y[i] - mc*x[i] - bc)
        m_loss += ml
        bl = (-2/n) * (y[i] - mc*x[i] - bc)
        b_loss += bl
    return m_loss, b_loss

ep = 100000
alpha = 0.000000001 # learning rate

m, b = 0, 0
for e in range(ep):
    y_pred = m*x_vals + b
    m_loss, b_loss = gradient_loss(y_vals, x_vals, m, b)
    print("m, b, m_loss, b_loss:", m, b, m_loss, b_loss)
    m = m - m_loss*alpha
    b = b - b_loss*0.001

I took the derivative of the squared loss, and using learning rate $\alpha$ for computing $m$ and intercept $b$ (sorry it is the same as $c$ in the previous block) for 100000 episodes. But note that I have to use different learning rate for m and b. No matter what learning rate I use, it does not work for both m and b. If I write in the last line 'b = b - b_loss*alpha', it never converges for any value of alpha

I've never seen in any article or book to use different learning rate for slope m and intercept b. Can anybody please explain what's happening here? And why I had to use different learning rate? :(

Alberto · Accepted Answer · 2024-04-09 09:12:17Z

0

It's called ill-conditioning in optimization, which means that you are a lot sensitive to changes with respect to some parameters than others.

As you can see, the curvature is not the same in the X direction and in the Y direction (in your case, it would be w and m) and it's the reason why adaptive optimizers and momentum exists

For linear regression, since the parameter space usually is small, you can also use the Limited-memory BFGS optimizer, or even the Newton Method, thus using 2-order informations to condition your step

answered Apr 9, 2024 at 9:12

Alberto

3,0035 silver badges12 bronze badges

$\begingroup$ But isn't it obvious that the curvature will be different in x and y direction (m and b)? I never FOUND any real data where the curvature will be same in both direction! But in EVERY BOOK and LINKs I found they are using exact same learning rate for both. But why? For example, please see link1 and link2. Shouldn't it be more general to use learning parameters $\alpha_1$ and $\alpha_2$ instead of just $\alpha$? $\endgroup$

Jahid Chowdhury Choton
– Jahid Chowdhury Choton

2024-04-09 09:22:45 +00:00
Commented Apr 9, 2024 at 9:22
$\begingroup$ @JahidChowdhuryChoton what if you have 10 covariates? then you’ll have to manually tune 10 hyperparametrs… instead what is usually done is to set a stepsize that is fine for every parameter (which indeed is what optimization literature tells you to do with the Lipschitz constant)… in your case, it will (slowly but surely) converge to the minima if you use 0.000000001 for both of them $\endgroup$

Alberto
– Alberto

2024-04-09 11:11:23 +00:00
Commented Apr 9, 2024 at 11:11
$\begingroup$ Thank you for your valuable answer. I cannot upvote because my reputation is low. But I'll wait for a few more answers. You're right it will converge (but VERY SLOWLY) if I use 0.00000001 for both of them. But isn't it impractical since it is very slow? If I have 10 co-variates, it may become more and more impractical to use a single step size (which would be much more slower). Manually tuning can drastically reduce number of episodes. Can you suggest any reading material on optimization? Is it related to stability in systems? $\endgroup$

Jahid Chowdhury Choton
– Jahid Chowdhury Choton

2024-04-09 14:46:24 +00:00
Commented Apr 9, 2024 at 14:46
$\begingroup$ @JahidChowdhuryChoton it’s also impractical to tune manually each single one… instead we usually set an initial guess, and use information about the optimization history to change it (check SGD with momentum , Nesterov Momentum, and AdaGrad) $\endgroup$

Alberto
– Alberto

2024-04-09 19:08:57 +00:00
Commented Apr 9, 2024 at 19:08

Add a comment |

Stack Exchange Network

Why same learning rate for slope and intercept not working in Linear regression?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why same learning rate for slope and intercept not working in Linear regression?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions