Skip to main content

Questions tagged [gradient-descent]

For questions surrounding gradient descent, a method for finding the optimum state of a parameterized function based on another function often called the loss or error function. It iteratively descends the loss surface to the minimum loss by adjusting parameters based on the product of the partial derivatives comprising the gradient and a learning rate.

Filter by
Sorted by
Tagged with
0 votes
1 answer
47 views

I got the following problem in a Computational Intelligence course exam. Analyze the following formulas for training of an MLP as an alternative training algorithm for MLPs. Tell the pros and cons of ...
Iman ghader's user avatar
1 vote
0 answers
75 views

Note up front: Please don’t confuse my current question with the well-known issue of noisy or varying gradient directions in stochastic gradient descent due to batch sampling. I’m aware of that and ...
Igor's user avatar
  • 325
0 votes
0 answers
28 views

I have read thru the PPO paper by Schulman et al. (literally line by line) and review related posts on AI and Stack Overflow. I am missing something and not understanding how the the gradient ascent ...
Sohail Shaikh's user avatar
3 votes
1 answer
125 views

(Py)torch has a quite sophisticated autograd system. Essentially, it tracks which tensor was built from which one. That is very fine, if it can be applied in the problem. However, in the case of my ...
peterh's user avatar
  • 245
0 votes
0 answers
32 views

I'm faced with a problem where as the title says I'm having trouble with the torch package's built in automatic differentiation algorithms (or my usage?). I think it was meant to be used on mini-...
Nomi Mino's user avatar
1 vote
1 answer
44 views

I am learning linear regression model based on this tutorial. Following the example provided in the tutorial, it works fine with mini-batch stochastic gradient descent. ...
hguser's user avatar
  • 101
2 votes
1 answer
132 views

I am learning about double descent phenomenon from here: https://www.di.ens.fr/~fbach/learning_theory_class/lecture9.pdf I was asking myself: When training a system, how can we know in which regime ...
Thomas's user avatar
  • 265
3 votes
1 answer
106 views

So, deep learning models are great at learning complex, non-linear patterns and seem to handle noise just fine. But under the hood, they rely on IEEE754 floating-point numbers, which can lose ...
Muhammad Ikhwan Perwira's user avatar
1 vote
1 answer
55 views

In gradient descent for neural networks, we optimize over a loss surface defined by our loss function L(W) where W represents the network weights. However, since there are infinitely many possible ...
semahaissa's user avatar
5 votes
1 answer
326 views

What is grad_fn for a non-differentiable function like slicing (grad_fn=<SliceBackward0>), ...
Geremia's user avatar
  • 577
2 votes
1 answer
77 views

Does it make sense for a computational graph to have entirely non-differentiable functions? For example, PyTorch can handle non-differentiable functions and mark outputs as non-differentiable, but I'm ...
Geremia's user avatar
  • 577
3 votes
2 answers
152 views

Do computational graphs predate the era of machine learning? If so, who first devised the idea of a computational graph?
Geremia's user avatar
  • 577
0 votes
1 answer
46 views

Suppose the hardware constraint is not a problem anymore, so that the quantum computer is everywhere. If we define a neural network model that has many params, traditionally (using gradient descent) ...
Muhammad Ikhwan Perwira's user avatar
2 votes
1 answer
65 views

I have a scenario when I am trying to optimize a vector of D dimensions. Every component of the vector is dependent on other components according to a function such as: summation over (i,j): (1-e(x_i)(...
Darkmoon Chief's user avatar
1 vote
1 answer
61 views

A = (m x n) - input B = (n x k) - weight output = A @ B = (m x k) ...
Тима 's user avatar
1 vote
1 answer
67 views

Some notations for the question: $w_{ij}^l$ is the weight connecting ith neuron of the layer l to the jth neuron of the layer $l-1$. $z_i^l$ is the activation of ith neuron in the layer l (for ...
Vedant Yadav's user avatar
1 vote
1 answer
81 views

I am trying to fit a Chapman-Richards growth curve: $$ B = A*(1-e^{-kt}) $$ Where B is the biomass of a forest, A is the asymptote, k is the growth rate, and t is forest age. I expect the growth rate ...
Ana Catarina Vitorino's user avatar
0 votes
1 answer
52 views

imaging input vector a = {a1,a2,a3} and z = softmax(a) = {z1,z2,z3} So, we expect than gradient of z with respect of a would be the same shape as vector a (so we can make gradient step: a = a - ...
Тима 's user avatar
0 votes
1 answer
113 views

I'm exploring machine learning and currently studying calculus, specifically gradient descent. To practice, I am using the function: $$ f(x,y)= x^{2}y $$ I have implemented the gradient descent ...
Ian Aragão's user avatar
2 votes
3 answers
2k views

I am currently following course 1 of Andrew Ng's Machine Learning Specialization. I understood we need a convex cost function to reach the global minimum of the loss. But that means the gradient will ...
Namirah Rasul's user avatar
0 votes
0 answers
177 views

I was looking into the loss function in t5x here and see there is a z-loss added to the typical log loss definition. The only paper I could surface on this was https://arxiv.org/abs/1604.08859, but I ...
Jacob B's user avatar
  • 279
3 votes
1 answer
54 views

I'm seeing different manners to define momentum, I'm not sure if there is significant difference or not. From my thinking, they seem to do a similar thing mathematically and in practice but I'm ...
tensor's user avatar
  • 125
0 votes
0 answers
152 views

I have been trying to train a model with 0.001 learning rate. I tried regression techniques, early stopping and lr manipulations within epochs. But nothing felt right even though after numerous tries ...
Yigithan Sever's user avatar
0 votes
1 answer
67 views

Currently doing andrew ng's unsupervised learning specialization, I came across this algorithm for collaborative filtering: here the Xi refers to feature vector of objects(ex: action in movies, ...
SRAVAN KOTTA's user avatar
0 votes
0 answers
31 views

I am training a transformer based neural network and the validation loss is not decreasing, but the training loss does decrease. I am wondering if it's possible to debug or change the architecture ...
JobHunter69's user avatar
1 vote
1 answer
79 views

I'm a new student in AI, currently learning linear regression. I used the california housing dataset for doing my experiments. My goal is to predict the 'population' column based on the 'total_rooms' ...
Jahid Chowdhury Choton's user avatar
1 vote
1 answer
127 views

I was playing around gradient descent topic. Wrote a function that calculates a gradient descent of a degree-2 polynomial. While trying out what is the best "step size multiplyer" (a.k.a. &...
Ababababa's user avatar
  • 113
2 votes
1 answer
265 views

I was looking at the algorithm for REINFORCE with baseline from the Book 'Introduction to Reinforcement Learning' from Sutton: I do not quite understand the update rule for $w$: $w = w + \alpha \...
kklaw's user avatar
  • 195
3 votes
1 answer
346 views

My understanding is this: When doing Stochastic Gradient Descent over a neural network, in every epoch, we run $n$ iterations (where the dataset has $n$ training examples) and in every iteration, we ...
insipidintegrator's user avatar
2 votes
1 answer
110 views

Been reviewing some old foundational material and ran into this comment by Hinton on Rprop in his old Coursera class: Rprop is equivalent to using the gradient, but also dividing by the size of the ...
eof's user avatar
  • 121
1 vote
2 answers
66 views

Andrew Ng said in his slide that: However, there are numerous types of 'learning rate schedules' in TensorFlow that change the learning rate profile as training progresses. If it's true that these ...
Wong's user avatar
  • 11
1 vote
1 answer
104 views

I'm coding an FNN in Rust using the nalgebra crate. I coded the backpropagation based on this article from Brilliant (the link directly highlights the formulas' section I). The issue My network tends ...
Evry's user avatar
  • 13
0 votes
1 answer
122 views

I have read some resources about AI, and they all speak about the gradient. Is there any book focused on this? maybe with tons of images / diagrams? Cheers
zerunio's user avatar
0 votes
1 answer
57 views

If I understand right that each observation whithin a dataset, creates a different loss surface where we want to find the global minimum. How different those surfaces one from another? Would it be ...
Igor's user avatar
  • 325
1 vote
0 answers
75 views

Is a gradient descent step always supposed to decrease loss? I can think of a situation where it would seem that gradient descent would increase loss but maybe it I am misunderstanding a part of ...
Mike Levi's user avatar
2 votes
2 answers
122 views

Understanding the concept of "Gradient Flow" can be quite difficult as there is a lack of widely recognized and clearly defined resources that provide a comprehensive explanation. Although ...
v1998199904's user avatar
0 votes
1 answer
215 views

When it comes to the concept of "Gradient Norm," it can be challenging to find a widely recognized and clearly defined resource that offers a comprehensive explanation. While many search ...
StudentV's user avatar
11 votes
1 answer
11k views

From my understanding a leaky ReLU attempts to address issues of vanishing gradients and nonzero-centeredness by keeping neurons that fire with a negative value alive. With just this info to go off of,...
John Brown's user avatar
4 votes
1 answer
2k views

I am optimizing a neural network with Adam using 3 different losses. Their scale is very different, and the current method is to either sum the losses and clip the gradient or to manually weight them ...
Simon's user avatar
  • 263
2 votes
1 answer
682 views

I'm new to the field of AI (though I have a background in mathematics). As I was going through some documents, I read that there is a form of gradient clipping where the elements of the gradient that ...
Ukn0wn's user avatar
  • 21
2 votes
3 answers
603 views

How does gradient descent work with ReLU, imagine the weights are quite negative and so our "prediction" is 0, then not much is learned. Is there a risk that training gets stuck when weights ...
Dirk N's user avatar
  • 158
1 vote
1 answer
2k views

I know that gradient accumulation is (1) a way to reduce memory usage while still enabling the machine to fit a large dataset (2) reducing the noise of the gradient compared to SGD, and thus smoothing ...
Cyrus's user avatar
  • 111
0 votes
1 answer
522 views

Assuming a single perceptron (see figure), I have found two versions of how to use backpropagation to update the weights. The perceptron is split in two, so we see the weighted sum on the left (the ...
HTH's user avatar
  • 1
2 votes
2 answers
283 views

I have been reading some papers recently (example: https://arxiv.org/pdf/2012.00363.pdf) which seem to be training individual layers of, say, a transformer, holding the rest of the model frozen/...
nlp4892's user avatar
  • 21
0 votes
1 answer
61 views

I was reading a book on Deep Learning when I came across a line, more like a few words that didn't make apparent sense. Thus, we will often settle for sampling a random minibatch of examples every ...
HarshDarji's user avatar
1 vote
2 answers
110 views

So I have this function let call her $F:[0,1]^n \rightarrow \mathbb{R}$ and say $10 \le n \le 100$. I want to find some $x_0 \in [0,1]^n$ such that $F(x_0)$ is as small as possible. I don't think ...
Vladimir Zolotov's user avatar
0 votes
1 answer
149 views

I am asking this question because while designing my own model, I had repeated gradient explosion issues, so I wanted to try batch normalization. I really want to understand the details and math ...
liyu zerihun's user avatar
0 votes
1 answer
271 views

I'm trying to implement a simple neural network for classification (multi-class) as an exercise (written in C). During gradient descent, the weights and biases quickly get out of control and the ...
martinkunev's user avatar
2 votes
0 answers
58 views

I'm looking at an example where they define a policy $\pi_\theta(a_t|s_t)\sim \mathcal{N}(ks_t, \sigma)$, where $a_t$ and $s_t$ are action and state, while $\theta=(k,\sigma)$ are the parameters of ...
pippo's user avatar
  • 41
1 vote
1 answer
466 views

The purpose of training neural networks is to minimize a loss function, in this process we usually use gradient descent method. But in Calculus, if we want to find the global minimum of a ...
Proton's user avatar
  • 111

1
2 3 4 5