What is the difference between Gradient Descent and Newton's Gradient Descent?

Question

I understand what Gradient Descent does. Basically it tries to move towards the local optimal solution by slowly moving down the curve. I am trying to understand what is the actual difference between the plain gradient descent and the Newton's method?

From Wikipedia, I read this short line "Newton's method uses curvature information to take a more direct route." What does this intuitively mean?

curvature relates to how Newton's method uses the fuction's second order derivative. Gradient descent is typically first order. — akk
– akk, Commented Nov 30, 2012 at 13:01
Very similar, also with a good answer: math.stackexchange.com/q/1085436/407385 — rehctawrats
– rehctawrats, Commented Dec 20, 2018 at 8:21

Florian Brucker · Accepted Answer · 2013-05-24 05:47:03Z

81

At a local minimum (or maximum) x, the derivative of the target function f vanishes: f'(x) = 0 (assuming sufficient smoothness of f).

Gradient descent tries to find such a minimum x by using information from the first derivative of f: It simply follows the steepest descent from the current point. This is like rolling a ball down the graph of f until it comes to rest (while neglecting inertia).

Newton's method tries to find a point x satisfying f'(x) = 0 by approximating f' with a linear function g and then solving for the root of that function explicitely (this is called Newton's root-finding method). The root of g is not necessarily the root of f', but it is under many circumstances a good guess (the Wikipedia article on Newton's method for root finding has more information on convergence criteria). While approximating f', Newton's method makes use of f'' (the curvature of f). This means it has higher requirements on the smoothness of f, but it also means that (by using more information) it often converges faster.

edited May 24, 2013 at 5:47

answered Aug 22, 2012 at 5:37

Florian Brucker

10.6k6 gold badges56 silver badges92 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Terence Chow Over a year ago

I always see mentions of choosing the 'steepest descent'. What does that mean? Is that the most negative number of f'(x)?

Florian Brucker Over a year ago

@Chowza: If your domain is multi-dimensional, e.g. if f maps 2D points to real numbers, then the gradient of f at any point is not a scalar number but a vector. The reason is that the "steepness" of f at that point depends on the direction that you're looking in. It's like standing on a mountain top: If you look north the mountain may drop off very sharply, but to the other sides it may be less steep. Choosing the steepest descent hence means choosing that direction which causes the greatest change in your target function.

dashnick · Accepted Answer · 2016-02-06 20:20:51Z

16

Put simply, gradient descent you just take a small step towards where you think the zero is and then recalculate; Newton's method, you go all the way there.

edited Feb 6, 2016 at 20:20

answered Feb 5, 2016 at 22:41

dashnick

2,0781 gold badge22 silver badges37 bronze badges

6 Comments

bers Over a year ago

Is "all the way" true for a non-quadratic function?

dashnick Over a year ago

Yes, for non quadratic functions you are just approximating the first derivative with a line. This is a bit hand wavey but I think it's fine for intuition.

bers Over a year ago

Ok, I agree. All the way to "where you think the zero is" is undoubtedly correct.

Mr Purple Over a year ago

If the main difference as you say is "small steps" vs "all the way", could you elaborate on how the size of the "small step" is determined?

dashnick Over a year ago

@MrPurple it's not very well defined, small enough that the gradient doesn't change too much (so you don't keep zigzagging) but large enough that you make progress. A lot of research is around how to optimize this adaptively. For intuition, think like on the order of .1% of the x value.

|

Matsaulait · Accepted Answer · 2021-11-09 17:29:41Z

7

Building on the answer by @Cheng, it's helpful to realise that because Newton's Method finds the root of a function, we will apply Newton's method to f'() in order to find an optimum of f(). Therefore, the update rule for Newton's Method in this case is:

new_guess = old_guess - f'(old_guess)/f''(old_guess), where f''() is the curvature of the function to be optimised.

In comparison, the update rule in gradient descent is:

new_guess = old_guess - f'(old_guess)*alpha, where alpha denotes the step size.

From this you can roughly see how Newton's method uses the function's curvature f''() to increase or decrease the size of its update.

answered Nov 9, 2021 at 17:29

Matsaulait

1612 silver badges2 bronze badges

Comments

MimiEAM · Accepted Answer · 2017-05-12 17:15:27Z

5

Edit 2017: The original link is dead - but the way back machine still got it :) https://web.archive.org/web/20151122203025/http://www.cs.colostate.edu/~anderson/cs545/Lectures/week6day2/week6day2.pdf

this power point the main ideas are explained simply http://www.cs.colostate.edu/~anderson/cs545/Lectures/week6day2/week6day2.pdf

I hope this help :)

edited May 12, 2017 at 17:15

answered Aug 22, 2012 at 5:36

MimiEAM

2,6401 gold badge27 silver badges29 bronze badges

1 Comment

CpCd0y Over a year ago

The link is down

Cheng · Accepted Answer · 2020-04-23 10:22:12Z

If you simply compare Gradient Descent and Newton's method, the purpose of the two methods are different.

Gradient Descent is used to find(approximate) local maxima or minima (x to make min f(x) or max f(x)). While Newton's method is to find(approximate) the root of a function, i.e. x to make f(x) = 0

In this sense, they are used to solve different problems. However, Newton's method can also be used in the context of optimization (the realm that GD is solving). Because finding maxima or minima can be approached by finding f'(x) = 0 which is exactly Newton's method is used for.

In conclusion, two methods can be used in optimization: 1)GD and 2)find x so f'(x)=0 and Newton's method is just a way to solve that second problem.

Collectives™ on Stack Overflow

What is the difference between Gradient Descent and Newton's Gradient Descent?

5 Answers 5

2 Comments

6 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

6 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related