What is the role of Numerical Gradient Computation in Backpropagation algorithm?

Question

I was listening CS231n (2017) lectures and noted that there is a lot of attention to Numerical Gradient Computation (NGC). It starts @5:53 in this video and appears a few times later.

Also, looking at the batch normalization materials (example), I found a lot of attention drawn to exactly the same topic (well, probably because it is the same backpropagation...).

As I understood, gradients we use in various optimization methods (vanilla SGD, Adam) require us to know activation function derivative. I suppose, if the activation function is complex or we are lazy enough to take derivative analytically, we need to compute gradient numerically and that is where we use NGC.

Questions:

Is that the only purpose of NGC in backpropagation?
Isn't it faster to use analytically form of the activation function derivative to calculate gradients?

Maxim · Accepted Answer · 2018-01-09 20:28:30Z

The purpose is pure educational. Students that jump straight to mid- or high-level libraries like tensorflow, keras, theano, etc don't have to compute the gradients themselves. On the one hand, it saves a lot of time, but on the other hand, it is makes it very easy to abstract away the learning process.

Here's how Andrej Karpathy puts it (the lecturer of the previous cs231n classes at Stanford):

When we offered CS231n (Deep Learning class) at Stanford, we intentionally designed the programming assignments to include explicit calculations involved in backpropagation on the lowest level. The students had to implement the forward and the backward pass of each layer in raw numpy. Inevitably, some students complained on the class message boards:

“Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”

...

The problem with Backpropagation is that it is a leaky abstraction.

In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.

I recommend to read the whole post, it's very interesting.

So, you try to compute gradients manually. And when you do that you find that it's pretty hard to assess if the code is right: it's just a raw formula that takes a bunch of floating numbers and returns another bunch of floating numbers. And here you find the alternative numerical method that you can compare to very useful.

Of course, analytical formulas are faster and more precise and they are used in practice whenever possible. But while studying neural networks and back-propagation, it's very useful to get through manual computation at least once. Besides, it sometimes helps to find the bugs.

$\begingroup$ fantastic! thank you very much for such detailed answer! $\endgroup$

Dmitrii
– Dmitrii

2018-01-09 22:25:23 +00:00
Commented Jan 9, 2018 at 22:25 — Dmitrii
– Dmitrii, Commented Jan 9, 2018 at 22:25

Stack Exchange Network

What is the role of Numerical Gradient Computation in Backpropagation algorithm?

1 Answer 1

Your Answer

Hot Network Questions

What is the role of Numerical Gradient Computation in Backpropagation algorithm?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions