Questions tagged [gradient-descent]
For questions surrounding gradient descent, a method for finding the optimum state of a parameterized function based on another function often called the loss or error function. It iteratively descends the loss surface to the minimum loss by adjusting parameters based on the product of the partial derivatives comprising the gradient and a learning rate.
223 questions
0
votes
1
answer
47
views
What are the pros and cons of this algorithm for training of an MLP?
I got the following problem in a Computational Intelligence course exam.
Analyze the following formulas for training of an MLP as an alternative training algorithm for MLPs. Tell the pros and cons of ...
1
vote
0
answers
75
views
Does using per-parameter adaptive learning rates (e.g. in Adam) change the direction of the gradient and break steepest descent?
Note up front:
Please don’t confuse my current question with the well-known issue of noisy or varying gradient directions in stochastic gradient descent due to batch sampling. I’m aware of that and ...
0
votes
0
answers
28
views
Proximal Policy Optimization - how the gradient ascent works on theta?
I have read thru the PPO paper by Schulman et al. (literally line by line) and review related posts on AI and Stack Overflow. I am missing something and not understanding how the the gradient ascent ...
3
votes
1
answer
125
views
Can torch use different NN optimization algorithms as gradient descent?
(Py)torch has a quite sophisticated autograd system. Essentially, it tracks which tensor was built from which one. That is very fine, if it can be applied in the problem.
However, in the case of my ...
0
votes
0
answers
32
views
Torch gradient estimates disagreeing with analytic and perturbation approximated gradients
I'm faced with a problem where as the title says I'm having trouble with the torch package's built in automatic differentiation algorithms (or my usage?). I think it was meant to be used on mini-...
1
vote
1
answer
44
views
Loss keep increasing when using full-batch gradient descent
I am learning linear regression model based on this tutorial. Following the example provided in the tutorial, it works fine with mini-batch stochastic gradient descent.
...
2
votes
1
answer
132
views
Learning curve behaviors across double descent regimes
I am learning about double descent phenomenon from here: https://www.di.ens.fr/~fbach/learning_theory_class/lecture9.pdf
I was asking myself:
When training a system, how can we know in which regime ...
3
votes
1
answer
106
views
Why doesn't deep learning use modular arithmetic like cryptography, even though both deal with non-linear functions?
So, deep learning models are great at learning complex, non-linear patterns and seem to handle noise just fine. But under the hood, they rely on IEEE754 floating-point numbers, which can lose ...
1
vote
1
answer
55
views
How can gradient descent optimize a loss surface that's never fully computed?
In gradient descent for neural networks, we optimize over a loss surface defined by our loss function L(W) where W represents the network weights. However, since there are infinitely many possible ...
5
votes
1
answer
326
views
Is PyTorch's `grad_fn` for a non-differentiable function that function's inverse?
What is grad_fn for a non-differentiable function like slicing (grad_fn=<SliceBackward0>), ...
2
votes
1
answer
77
views
Does it make sense for a computational graph to have entirely non-differentiable functions?
Does it make sense for a computational graph to have entirely non-differentiable functions?
For example, PyTorch can handle non-differentiable functions and mark outputs as non-differentiable, but I'm ...
3
votes
2
answers
152
views
Do computational graphs predate the era of machine learning?
Do computational graphs predate the era of machine learning? If so, who first devised the idea of a computational graph?
0
votes
1
answer
46
views
global minimum loss always best metric?
Suppose the hardware constraint is not a problem anymore, so that the quantum computer is everywhere.
If we define a neural network model that has many params, traditionally (using gradient descent) ...
2
votes
1
answer
65
views
Do we plug in the old values or the new values during the gradient descent update?
I have a scenario when I am trying to optimize a vector of D dimensions. Every component of the vector is dependent on other components according to a function such as: summation over (i,j): (1-e(x_i)(...
1
vote
1
answer
61
views
How to normalize gradient value due to the batch size?
A = (m x n) - input
B = (n x k) - weight
output = A @ B = (m x k)
...
1
vote
1
answer
67
views
Gradient calculation in Backpropogation
Some notations for the question: $w_{ij}^l$ is the weight connecting ith neuron of the layer l to the jth neuron of the layer $l-1$. $z_i^l$ is the activation of ith neuron in the layer l (for ...
1
vote
1
answer
81
views
Options for fitting a growth curve - process-based, hybrid, or neural networks?
I am trying to fit a Chapman-Richards growth curve:
$$
B = A*(1-e^{-kt})
$$
Where B is the biomass of a forest, A is the asymptote, k is the growth rate, and t is forest age. I expect the growth rate ...
0
votes
1
answer
52
views
Softmax gradient for automatic differentiation
imaging input vector a = {a1,a2,a3}
and z = softmax(a) = {z1,z2,z3}
So, we expect than gradient of z with respect of a would be the same shape as vector a (so we can make gradient step: a = a - ...
0
votes
1
answer
113
views
Why is gradient clipping not preventing my gradient descent from going out of bounds?
I'm exploring machine learning and currently studying calculus, specifically gradient descent. To practice, I am using the function: $$ f(x,y)= x^{2}y $$
I have implemented the gradient descent ...
2
votes
3
answers
2k
views
Why exactly do we need the learning rate in gradient descent?
I am currently following course 1 of Andrew Ng's Machine Learning Specialization. I understood we need a convex cost function to reach the global minimum of the loss. But that means the gradient will ...
0
votes
0
answers
177
views
How is this z-loss implementation in t5x related to this paper's loss X?
I was looking into the loss function in t5x here and see there is a z-loss added to the typical log loss definition.
The only paper I could surface on this was https://arxiv.org/abs/1604.08859, but I ...
3
votes
1
answer
54
views
Different Definitions of Momentum -- which one should I work with?
I'm seeing different manners to define momentum, I'm not sure if there is significant difference or not.
From my thinking, they seem to do a similar thing mathematically and in practice but I'm ...
0
votes
0
answers
152
views
Learning Rate greater than ~0.00005 significantly hinders model performance and increases loss
I have been trying to train a model with 0.001 learning rate. I tried regression techniques, early stopping and lr manipulations within epochs. But nothing felt right even though after numerous tries ...
0
votes
1
answer
67
views
collaborative filtering using linear regression
Currently doing andrew ng's unsupervised learning specialization, I came across this algorithm for collaborative filtering:
here the Xi refers to feature vector of objects(ex: action in movies, ...
0
votes
0
answers
31
views
Is there any purpose of altering neural network architecture if validation loss does not decrease but training loss does?
I am training a transformer based neural network and the validation loss is not decreasing, but the training loss does decrease. I am wondering if it's possible to debug or change the architecture ...
1
vote
1
answer
79
views
Why same learning rate for slope and intercept not working in Linear regression?
I'm a new student in AI, currently learning linear regression. I used the california housing dataset for doing my experiments. My goal is to predict the 'population' column based on the 'total_rooms' ...
1
vote
1
answer
127
views
Is there a theoretical way to determine the best learning rate for gradient descent if the function is a simple known polynomial?
I was playing around gradient descent topic. Wrote a function that calculates a gradient descent of a degree-2 polynomial. While trying out what is the best "step size multiplyer" (a.k.a. &...
2
votes
1
answer
265
views
REINFORCE with Baseline update rule
I was looking at the algorithm for REINFORCE with baseline from the Book 'Introduction to Reinforcement Learning' from Sutton:
I do not quite understand the update rule for $w$:
$w = w + \alpha \...
3
votes
1
answer
346
views
What do you mean by "updating based on a training example/batch" in Gradient Descent?
My understanding is this: When doing Stochastic Gradient Descent over a neural network, in every epoch, we run $n$ iterations (where the dataset has $n$ training examples) and in every iteration, we ...
2
votes
1
answer
110
views
Can you explain the Hinton's comment "Rprop is equivalent to using the gradient, but also dividing by the size of the gradient"?
Been reviewing some old foundational material and ran into this comment by Hinton on Rprop in his old Coursera class:
Rprop is equivalent to using the gradient, but also dividing by the
size of the ...
1
vote
2
answers
66
views
Why use learning rate schedules if weight updates automatically decrease when approaching local optimal?
Andrew Ng said in his slide that:
However, there are numerous types of 'learning rate schedules' in TensorFlow that change the learning rate profile as training progresses.
If it's true that these ...
1
vote
1
answer
104
views
A Feedforward Neural Network (FNN) implemented with RMSProp optimization is exhibiting a tendency to overclassify instances into one particular class
I'm coding an FNN in Rust using the nalgebra crate. I coded the backpropagation based on this article from Brilliant (the link directly highlights the formulas' section I).
The issue
My network tends ...
0
votes
1
answer
122
views
Gradient: any resource on how to understand everything about it?
I have read some resources about AI, and they all speak about the gradient.
Is there any book focused on this? maybe with tons of images / diagrams?
Cheers
0
votes
1
answer
57
views
What are the differences between loss surfaces that "derive"from different observations?
If I understand right that each observation whithin a dataset, creates a different loss surface where we want to find the global minimum.
How different those surfaces one from another?
Would it be ...
1
vote
0
answers
75
views
Can gradient descent cause loss to increase in some situations?
Is a gradient descent step always supposed to decrease loss? I can think of a situation where it would seem that gradient descent would increase loss but maybe it I am misunderstanding a part of ...
2
votes
2
answers
122
views
Is there a resource that offers a detailed overview of the gradient flow?
Understanding the concept of "Gradient Flow" can be quite difficult as there is a lack of widely recognized and clearly defined resources that provide a comprehensive explanation. Although ...
0
votes
1
answer
215
views
Is there a recommended resource that can provide a detailed overview of the gradient norm?
When it comes to the concept of "Gradient Norm," it can be challenging to find a widely recognized and clearly defined resource that offers a comprehensive explanation. While many search ...
11
votes
1
answer
11k
views
Why use ReLU over Leaky ReLU?
From my understanding a leaky ReLU attempts to address issues of vanishing gradients and nonzero-centeredness by keeping neurons that fire with a negative value alive.
With just this info to go off of,...
4
votes
1
answer
2k
views
What is the best way to combine or weight multiple losses with gradient descent?
I am optimizing a neural network with Adam using 3 different losses. Their scale is very different, and the current method is to either sum the losses and clip the gradient or to manually weight them ...
2
votes
1
answer
682
views
What is the justification for this approach of clipping elementwise?
I'm new to the field of AI (though I have a background in mathematics).
As I was going through some documents, I read that there is a form of gradient clipping where the elements of the gradient that ...
2
votes
3
answers
603
views
How does gradient descent work with ReLU if weights are negative?
How does gradient descent work with ReLU, imagine the weights are quite negative and so our "prediction" is 0, then not much is learned. Is there a risk that training gets stuck when weights ...
1
vote
1
answer
2k
views
Why to use gradient accumulation?
I know that gradient accumulation is (1) a way to reduce memory usage while still enabling the machine to fit a large dataset (2) reducing the noise of the gradient compared to SGD, and thus smoothing ...
0
votes
1
answer
522
views
Single Layer Perceptron Backpropagation: How to compute affect of the net value on the output?
Assuming a single perceptron (see figure), I have found two versions of how to use backpropagation to update the weights. The perceptron is split in two, so we see the weighted sum on the left (the ...
2
votes
2
answers
283
views
How are gradients of individual layers computed?
I have been reading some papers recently (example: https://arxiv.org/pdf/2012.00363.pdf) which seem to be training individual layers of, say, a transformer, holding the rest of the model frozen/...
0
votes
1
answer
61
views
What are your "current parameters" in Minibatch Stochastic Gradient Descent?
I was reading a book on Deep Learning when I came across a line, more like a few words that didn't make apparent sense.
Thus, we will often settle for sampling a random minibatch of examples every ...
1
vote
2
answers
110
views
Can I minimize a mysterious function by running a gradient descent on her neural net approximations?
So I have this function let call her $F:[0,1]^n \rightarrow \mathbb{R}$ and say $10 \le n \le 100$. I want to find some $x_0 \in [0,1]^n$ such that $F(x_0)$ is as small as possible. I don't think ...
0
votes
1
answer
149
views
During batch normalization is the mini-batch gone through twice, one to calculate the mean and variance and then to normalize them?
I am asking this question because while designing my own model, I had repeated gradient explosion issues, so I wanted to try batch normalization. I really want to understand the details and math ...
0
votes
1
answer
271
views
Numerical problems with gradient descent
I'm trying to implement a simple neural network for classification (multi-class) as an exercise (written in C). During gradient descent, the weights and biases quickly get out of control and the ...
2
votes
0
answers
58
views
Can objective function and gradient be unlimited in reinforcement learning?
I'm looking at an example where they define a policy $\pi_\theta(a_t|s_t)\sim \mathcal{N}(ks_t, \sigma)$, where $a_t$ and $s_t$ are action and state, while $\theta=(k,\sigma)$ are the parameters of ...
1
vote
1
answer
466
views
Why do we use gradient descent to minimize the loss function?
The purpose of training neural networks is to minimize a loss function, in this process we usually use gradient descent method.
But in Calculus, if we want to find the global minimum of a ...