3

In Machine learning regression problem, why the local minimum is computed for a derivative function instead of the actual function?

Example: http://en.wikipedia.org/wiki/Gradient_descent

The gradient descent algorithm is applied to find a local minimum of the function $$

f(x)=x^4−3x^3+2, ----(A)

with derivative

f'(x)=4x^3−9x^2. ----(B)

Here to find the local minimum using gradient descent algorithm for the function (A) they have used the derivative function of (A) which is function (B).

2 Answers 2

3

The reason is that because the function is concave (or convex if you're doing maximisation---these problems are equivalent), you know that there is a single minimum (maximum). This means that there is a single point where the gradient is equal to zero. There are techniques that use the function itself, but if you can compute the gradient, you can converge much faster because you can think of the gradient giving you information about how far you are from the optimum.

As well as Gradient Descent, there's an optimisation method known as Newton's method, which requires computation of the second derivative (the Hessian in multi-variate optimisation). This converges even faster still, but requires you to be able to invert the Hessian which is not feasible if you have a lot of parameters. So there are methods to get around this which compute a limited memory approximation of the Hessian. These methods converge faster still because they're using information about the curvature of the gradient: it's a simple tradeoff, where the more you know about the function you're trying to optimise, the faster you can find the solution.

Sign up to request clarification or add additional context in comments.

3 Comments

Sorry, i have a very silly question. f'(x)=4x^3−9x^2 Just by looking at the function we can find the local minimum, i.e. f'(x) = 0 will occur for x=0 then why we need a gradient descent.
Well, it's not quite just by looking at it: you can solve for x when f'(x) = 0. This is a simple example: most of the time when you use optimisation methods, you can't get an analytic solution. Take a look at en.wikipedia.org/wiki/Convex_optimization for more details.
@BenAllison So it is because of computational cost? I mean if we have a loss function f(x) we want to find that x that makes it minimum. But this mean that we have to check out all possible values of x. Instead we compute gradient at some random initial x and move towards the minimum (according to gradient) saving computational time?
2

I'm not a mathematician - so I can't give you exact answer, however, you need to understand what derivation does, e.g.:

http://en.wikipedia.org/wiki/Derivative http://en.wikipedia.org/wiki/Differential_of_a_function

this is what you need(what differentiation do): http://en.wikipedia.org/wiki/File:Graph_of_sliding_derivative_line.gif

The derivative at a point equals the slope of the tangent line to the graph of the function at that point. And this is exactly what you want when you are looking a descent. Take it as very informal point of view, wikipedia articles will give you much deeper and precise knowledge ...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.