Newest 'gradient-descent' Questions

0 votes

1 answer

47 views

What are the pros and cons of this algorithm for training of an MLP?

I got the following problem in a Computational Intelligence course exam. Analyze the following formulas for training of an MLP as an alternative training algorithm for MLPs. Tell the pros and cons of ...

Iman ghader

1

asked Oct 8 at 20:44

1 vote

0 answers

75 views

Does using per-parameter adaptive learning rates (e.g. in Adam) change the direction of the gradient and break steepest descent?

Note up front: Please don’t confuse my current question with the well-known issue of noisy or varying gradient directions in stochastic gradient descent due to batch sampling. I’m aware of that and ...

Igor

325

asked Jul 29 at 14:28

0 votes

0 answers

28 views

Proximal Policy Optimization - how the gradient ascent works on theta?

I have read thru the PPO paper by Schulman et al. (literally line by line) and review related posts on AI and Stack Overflow. I am missing something and not understanding how the the gradient ascent ...

Sohail Shaikh

13

asked Jul 10 at 22:04

3 votes

1 answer

125 views

Can torch use different NN optimization algorithms as gradient descent?

(Py)torch has a quite sophisticated autograd system. Essentially, it tracks which tensor was built from which one. That is very fine, if it can be applied in the problem. However, in the case of my ...

peterh

245

asked Jun 13 at 18:08

0 votes

0 answers

32 views

Torch gradient estimates disagreeing with analytic and perturbation approximated gradients

I'm faced with a problem where as the title says I'm having trouble with the torch package's built in automatic differentiation algorithms (or my usage?). I think it was meant to be used on mini-...

Nomi Mino

1

asked Apr 19 at 14:54

1 vote

1 answer

44 views

Loss keep increasing when using full-batch gradient descent

I am learning linear regression model based on this tutorial. Following the example provided in the tutorial, it works fine with mini-batch stochastic gradient descent. ...

hguser

101

asked Apr 10 at 11:38

2 votes

1 answer

132 views

Learning curve behaviors across double descent regimes

I am learning about double descent phenomenon from here: https://www.di.ens.fr/~fbach/learning_theory_class/lecture9.pdf I was asking myself: When training a system, how can we know in which regime ...

Thomas

265

asked Apr 4 at 7:40

3 votes

1 answer

106 views

Why doesn't deep learning use modular arithmetic like cryptography, even though both deal with non-linear functions?

So, deep learning models are great at learning complex, non-linear patterns and seem to handle noise just fine. But under the hood, they rely on IEEE754 floating-point numbers, which can lose ...

Muhammad Ikhwan Perwira

800

asked Mar 9 at 9:35

1 vote

1 answer

55 views

How can gradient descent optimize a loss surface that's never fully computed?

In gradient descent for neural networks, we optimize over a loss surface defined by our loss function L(W) where W represents the network weights. However, since there are infinitely many possible ...

semahaissa

11

asked Feb 15 at 8:11

5 votes

1 answer

326 views

Is PyTorch's `grad_fn` for a non-differentiable function that function's inverse?

What is grad_fn for a non-differentiable function like slicing (grad_fn=<SliceBackward0>), ...

Geremia

577

asked Feb 13 at 0:18

2 votes

1 answer

77 views

Does it make sense for a computational graph to have entirely non-differentiable functions?

Does it make sense for a computational graph to have entirely non-differentiable functions? For example, PyTorch can handle non-differentiable functions and mark outputs as non-differentiable, but I'm ...

Geremia

577

asked Feb 12 at 0:25

3 votes

2 answers

152 views

Do computational graphs predate the era of machine learning?

Do computational graphs predate the era of machine learning? If so, who first devised the idea of a computational graph?

Geremia

577

asked Feb 8 at 23:53

0 votes

1 answer

46 views

global minimum loss always best metric?

Suppose the hardware constraint is not a problem anymore, so that the quantum computer is everywhere. If we define a neural network model that has many params, traditionally (using gradient descent) ...

Muhammad Ikhwan Perwira

800

asked Nov 27, 2024 at 20:53

2 votes

1 answer

65 views

Do we plug in the old values or the new values during the gradient descent update?

I have a scenario when I am trying to optimize a vector of D dimensions. Every component of the vector is dependent on other components according to a function such as: summation over (i,j): (1-e(x_i)(...

Darkmoon Chief

31

asked Nov 5, 2024 at 10:07

1 vote

1 answer

61 views

How to normalize gradient value due to the batch size?

A = (m x n) - input B = (n x k) - weight output = A @ B = (m x k) ...

Тима

59

asked Nov 1, 2024 at 22:21

1 vote

1 answer

67 views

Gradient calculation in Backpropogation

Some notations for the question: $w_{ij}^l$ is the weight connecting ith neuron of the layer l to the jth neuron of the layer $l-1$. $z_i^l$ is the activation of ith neuron in the layer l (for ...

Vedant Yadav

13

asked Oct 25, 2024 at 15:13

1 vote

1 answer

81 views

Options for fitting a growth curve - process-based, hybrid, or neural networks?

I am trying to fit a Chapman-Richards growth curve: $$ B = A*(1-e^{-kt}) $$ Where B is the biomass of a forest, A is the asymptote, k is the growth rate, and t is forest age. I expect the growth rate ...

Ana Catarina Vitorino

13

asked Oct 25, 2024 at 1:13

0 votes

1 answer

52 views

Softmax gradient for automatic differentiation

imaging input vector a = {a1,a2,a3} and z = softmax(a) = {z1,z2,z3} So, we expect than gradient of z with respect of a would be the same shape as vector a (so we can make gradient step: a = a - ...

Тима

59

asked Oct 23, 2024 at 11:58

0 votes

1 answer

113 views

Why is gradient clipping not preventing my gradient descent from going out of bounds?

I'm exploring machine learning and currently studying calculus, specifically gradient descent. To practice, I am using the function: $$ f(x,y)= x^{2}y $$ I have implemented the gradient descent ...

Ian Aragão

3

asked Jul 26, 2024 at 3:42

2 votes

3 answers

2k views

Why exactly do we need the learning rate in gradient descent?

I am currently following course 1 of Andrew Ng's Machine Learning Specialization. I understood we need a convex cost function to reach the global minimum of the loss. But that means the gradient will ...

Namirah Rasul

21

asked Jul 25, 2024 at 16:39

0 votes

0 answers

177 views

How is this z-loss implementation in t5x related to this paper's loss X?

I was looking into the loss function in t5x here and see there is a z-loss added to the typical log loss definition. The only paper I could surface on this was https://arxiv.org/abs/1604.08859, but I ...

Jacob B

279

asked Jul 10, 2024 at 1:11

3 votes

1 answer

54 views

Different Definitions of Momentum -- which one should I work with?

I'm seeing different manners to define momentum, I'm not sure if there is significant difference or not. From my thinking, they seem to do a similar thing mathematically and in practice but I'm ...

tensor

125

asked Jul 3, 2024 at 16:22

0 votes

0 answers

152 views

Learning Rate greater than ~0.00005 significantly hinders model performance and increases loss

I have been trying to train a model with 0.001 learning rate. I tried regression techniques, early stopping and lr manipulations within epochs. But nothing felt right even though after numerous tries ...

Yigithan Sever

1

asked Jul 3, 2024 at 7:13

0 votes

1 answer

67 views

collaborative filtering using linear regression

Currently doing andrew ng's unsupervised learning specialization, I came across this algorithm for collaborative filtering: here the Xi refers to feature vector of objects(ex: action in movies, ...

SRAVAN KOTTA

1

asked May 8, 2024 at 6:53

0 votes

0 answers

31 views

Is there any purpose of altering neural network architecture if validation loss does not decrease but training loss does?

I am training a transformer based neural network and the validation loss is not decreasing, but the training loss does decrease. I am wondering if it's possible to debug or change the architecture ...

JobHunter69

233

asked Apr 27, 2024 at 18:15

1 vote

1 answer

79 views

Why same learning rate for slope and intercept not working in Linear regression?

I'm a new student in AI, currently learning linear regression. I used the california housing dataset for doing my experiments. My goal is to predict the 'population' column based on the 'total_rooms' ...

Jahid Chowdhury Choton

23

asked Apr 9, 2024 at 9:01

1 vote

1 answer

127 views

Is there a theoretical way to determine the best learning rate for gradient descent if the function is a simple known polynomial?

I was playing around gradient descent topic. Wrote a function that calculates a gradient descent of a degree-2 polynomial. While trying out what is the best "step size multiplyer" (a.k.a. &...

Ababababa

113

asked Feb 19, 2024 at 14:55

2 votes

1 answer

265 views

REINFORCE with Baseline update rule

I was looking at the algorithm for REINFORCE with baseline from the Book 'Introduction to Reinforcement Learning' from Sutton: I do not quite understand the update rule for $w$: $w = w + \alpha \...

kklaw

195

asked Feb 11, 2024 at 10:18

3 votes

1 answer

346 views

What do you mean by "updating based on a training example/batch" in Gradient Descent?

My understanding is this: When doing Stochastic Gradient Descent over a neural network, in every epoch, we run $n$ iterations (where the dataset has $n$ training examples) and in every iteration, we ...

insipidintegrator

143

asked Feb 5, 2024 at 5:44

2 votes

1 answer

110 views

Can you explain the Hinton's comment "Rprop is equivalent to using the gradient, but also dividing by the size of the gradient"?

Been reviewing some old foundational material and ran into this comment by Hinton on Rprop in his old Coursera class: Rprop is equivalent to using the gradient, but also dividing by the size of the ...

eof

121

asked Feb 4, 2024 at 11:32

1 vote

2 answers

66 views

Why use learning rate schedules if weight updates automatically decrease when approaching local optimal?

Andrew Ng said in his slide that: However, there are numerous types of 'learning rate schedules' in TensorFlow that change the learning rate profile as training progresses. If it's true that these ...

Wong

11

asked Jan 23, 2024 at 3:13

1 vote

1 answer

104 views

A Feedforward Neural Network (FNN) implemented with RMSProp optimization is exhibiting a tendency to overclassify instances into one particular class

I'm coding an FNN in Rust using the nalgebra crate. I coded the backpropagation based on this article from Brilliant (the link directly highlights the formulas' section I). The issue My network tends ...

Evry

13

asked Nov 16, 2023 at 11:11

0 votes

1 answer

122 views

Gradient: any resource on how to understand everything about it?

I have read some resources about AI, and they all speak about the gradient. Is there any book focused on this? maybe with tons of images / diagrams? Cheers

zerunio

3

asked Sep 29, 2023 at 10:05

0 votes

1 answer

57 views

What are the differences between loss surfaces that "derive"from different observations?

If I understand right that each observation whithin a dataset, creates a different loss surface where we want to find the global minimum. How different those surfaces one from another? Would it be ...

Igor

325

asked Aug 10, 2023 at 4:55

1 vote

0 answers

75 views

Can gradient descent cause loss to increase in some situations?

Is a gradient descent step always supposed to decrease loss? I can think of a situation where it would seem that gradient descent would increase loss but maybe it I am misunderstanding a part of ...

Mike Levi

11

asked Jun 18, 2023 at 21:02

2 votes

2 answers

122 views

Is there a resource that offers a detailed overview of the gradient flow?

Understanding the concept of "Gradient Flow" can be quite difficult as there is a lack of widely recognized and clearly defined resources that provide a comprehensive explanation. Although ...

v1998199904

21

asked May 27, 2023 at 18:51

0 votes

1 answer

215 views

Is there a recommended resource that can provide a detailed overview of the gradient norm?

When it comes to the concept of "Gradient Norm," it can be challenging to find a widely recognized and clearly defined resource that offers a comprehensive explanation. While many search ...

StudentV

11

asked May 27, 2023 at 17:38

11 votes

1 answer

11k views

Why use ReLU over Leaky ReLU?

From my understanding a leaky ReLU attempts to address issues of vanishing gradients and nonzero-centeredness by keeping neurons that fire with a negative value alive. With just this info to go off of,...

John Brown

113

asked May 24, 2023 at 21:44

4 votes

1 answer

2k views

What is the best way to combine or weight multiple losses with gradient descent?

I am optimizing a neural network with Adam using 3 different losses. Their scale is very different, and the current method is to either sum the losses and clip the gradient or to manually weight them ...

Simon

263

asked May 24, 2023 at 17:29

2 votes

1 answer

682 views

What is the justification for this approach of clipping elementwise?

I'm new to the field of AI (though I have a background in mathematics). As I was going through some documents, I read that there is a form of gradient clipping where the elements of the gradient that ...

Ukn0wn

21

asked Apr 27, 2023 at 17:08

2 votes

3 answers

603 views

How does gradient descent work with ReLU if weights are negative?

How does gradient descent work with ReLU, imagine the weights are quite negative and so our "prediction" is 0, then not much is learned. Is there a risk that training gets stuck when weights ...

Dirk N

158

asked Mar 30, 2023 at 20:04

1 vote

1 answer

2k views

Why to use gradient accumulation?

I know that gradient accumulation is (1) a way to reduce memory usage while still enabling the machine to fit a large dataset (2) reducing the noise of the gradient compared to SGD, and thus smoothing ...

Cyrus

111

asked Jan 25, 2023 at 22:46

0 votes

1 answer

522 views

Single Layer Perceptron Backpropagation: How to compute affect of the net value on the output?

Assuming a single perceptron (see figure), I have found two versions of how to use backpropagation to update the weights. The perceptron is split in two, so we see the weighted sum on the left (the ...

HTH

1

asked Jan 20, 2023 at 12:41

2 votes

2 answers

283 views

How are gradients of individual layers computed?

I have been reading some papers recently (example: https://arxiv.org/pdf/2012.00363.pdf) which seem to be training individual layers of, say, a transformer, holding the rest of the model frozen/...

nlp4892

21

asked Nov 12, 2022 at 17:11

0 votes

1 answer

61 views

What are your "current parameters" in Minibatch Stochastic Gradient Descent?

I was reading a book on Deep Learning when I came across a line, more like a few words that didn't make apparent sense. Thus, we will often settle for sampling a random minibatch of examples every ...

HarshDarji

101

asked Oct 9, 2022 at 2:31

1 vote

2 answers

110 views

Can I minimize a mysterious function by running a gradient descent on her neural net approximations?

So I have this function let call her $F:[0,1]^n \rightarrow \mathbb{R}$ and say $10 \le n \le 100$. I want to find some $x_0 \in [0,1]^n$ such that $F(x_0)$ is as small as possible. I don't think ...

Vladimir Zolotov

111

asked Sep 22, 2022 at 23:12

0 votes

1 answer

149 views

During batch normalization is the mini-batch gone through twice, one to calculate the mean and variance and then to normalize them?

I am asking this question because while designing my own model, I had repeated gradient explosion issues, so I wanted to try batch normalization. I really want to understand the details and math ...

liyu zerihun

1

asked Sep 5, 2022 at 10:15

0 votes

1 answer

271 views

Numerical problems with gradient descent

I'm trying to implement a simple neural network for classification (multi-class) as an exercise (written in C). During gradient descent, the weights and biases quickly get out of control and the ...

martinkunev

255

asked Aug 1, 2022 at 12:26

2 votes

0 answers

58 views

Can objective function and gradient be unlimited in reinforcement learning?

I'm looking at an example where they define a policy $\pi_\theta(a_t|s_t)\sim \mathcal{N}(ks_t, \sigma)$, where $a_t$ and $s_t$ are action and state, while $\theta=(k,\sigma)$ are the parameters of ...

pippo

41

asked Jul 5, 2022 at 9:11

1 vote

1 answer

466 views

Why do we use gradient descent to minimize the loss function?

The purpose of training neural networks is to minimize a loss function, in this process we usually use gradient descent method. But in Calculus, if we want to find the global minimum of a ...

Proton

111

asked Jul 5, 2022 at 2:41

Questions tagged [gradient-descent]