PyTorch doesn't seem to be optimizing correctly

Question

I have posted this question on Data Science StackExchange site since StackOverflow does not support LaTeX. Linking it here because this site is probably more appropriate.

The question with correctly rendered LaTeX is here: https://datascience.stackexchange.com/questions/48062/pytorch-does-not-seem-to-be-optimizing-correctly

The idea is that I am considering sums of sine waves with different phases. The waves are sampled with some sample rate s in the interval [0, 2pi]. I need to select phases in such a way, that the sum of the waves at any sample point is minimized.

Below is the Python code. Optimization does not seem to be computed correctly.

import numpy as np
import torch

def phaseOptimize(n, s = 48000, nsteps = 1000):
    learning_rate = 1e-3

    theta = torch.zeros([n, 1], requires_grad=True)
    l = torch.linspace(0, 2 * np.pi, s)
    t = torch.stack([l] * n)
    T = t + theta

    for jj in range(nsteps):
        loss = T.sin().sum(0).pow(2).sum() / s
        loss.backward()
        theta.data -= learning_rate * theta.grad.data

    print('Optimal theta: \n\n', theta.data)
    print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())

Below is a sample output.

phaseOptimize(5, nsteps=100)


Optimal theta: 

 tensor([[1.2812e-07],
        [1.2812e-07],
        [1.2812e-07],
        [1.2812e-07],
        [1.2812e-07]], requires_grad=True)


Maximum value: 5.0

I am assuming this has something to do with broadcasting in

T = t + theta

and/or the way I am computing the loss function.

One way to verify that optimization is incorrect, is to simply evaluate the loss function at random values for the array $\theta_1, \dots, \theta_n$, say uniformly distributed in $[0, 2\pi]$. The maximum value in this case is almost always much lower than the maximum value reported by phaseOptimize(). Much easier in fact is to consider the case with $n = 2$, and simply evaluate at $\theta_1 = 0$ and $\theta_2 = \pi$. In that case we get:

phaseOptimize(2, nsteps=100)

Optimal theta: 

 tensor([[2.8599e-08],
        [2.8599e-08]])


Maximum value: 2.0

On the other hand,

theta = torch.FloatTensor([[0], [np.pi]])
l = torch.linspace(0, 2 * np.pi, 48000)
t = torch.stack([l] * 2)
T = t + theta

T.sin().sum(0).abs().max().item()

produces

3.2782554626464844e-07

Sergii Dymchenko · Accepted Answer · 2019-03-27 09:18:28Z

2

You have to move computing T inside the loop, or it will always have the same constant value, thus constant loss.

Another thing is to initialize theta to different values at indices, otherwise because of the symmetric nature of the problem the gradient is the same for every index.

Another thing is that you need to zero gradient, because backward just accumulates them.

This seems to work:

def phaseOptimize(n, s = 48000, nsteps = 1000):
    learning_rate = 1e-1

    theta = torch.zeros([n, 1], requires_grad=True)
    theta.data[0][0] = 1
    l = torch.linspace(0, 2 * np.pi, s)
    t = torch.stack([l] * n)

    for jj in range(nsteps):
        T = t + theta
        loss = T.sin().sum(0).pow(2).sum() / s
        loss.backward()
        theta.data -= learning_rate * theta.grad.data
        theta.grad.zero_()

answered Mar 27, 2019 at 9:18

Sergii Dymchenko

7,2671 gold badge25 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jatentaki · Accepted Answer · 2019-03-27 09:19:39Z

You're being bitten by both PyTorch and math. Firstly, you need to

Zero out the gradient by setting theta.grad = None before each backward step. Otherwise the gradients accumulate instead of overwriting the previous ones
You need to recalculate T at each step. PyTorch is not symbolic, unlike TensorFlow and T = t + theta means "T equals the sum of current t and current theta" and not "T equals the sum of t and theta, whatever their values may be at any time in the future".

With those fixes you get the following code:

def phaseOptimize(n, s = 48000, nsteps = 1000):
    learning_rate = 1e-3

    theta = torch.zeros(n, 1, requires_grad=True)
    l = torch.linspace(0, 2 * np.pi, s)
    t = torch.stack([l] * n)
    T = t + theta

    for jj in range(nsteps):
        T = t + theta
        loss = T.sin().sum(0).pow(2).sum() / s
        theta.grad = None
        loss.backward()
        theta.data -= learning_rate * theta.grad.data

    T = t + theta

    print('Optimal theta: \n\n', theta.data)
    print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())

which will still not work as you expect because of math.

One can easily see that the minimum to your loss function is when theta are also uniformly spaced over [0, 2pi). The problem is that you are initializing your parameters as torch.zeros, which leads to all those values being equal (this is the polar opposite of equispaced!). Since your loss function is symmetrical with respect to permutations of theta, the computed gradients are equal and the gradient descent algorithm can never "differentiate them". In more mathematical terms, you're unlucky enough to initialize your algorithm exactly on a saddle point, so it cannot continue. If you add any noise, it will converge. For instance with

theta = torch.zeros(n, 1) + 0.001 * torch.randn(n, 1)
theta.requires_grad_(True)

Collectives™ on Stack Overflow

PyTorch doesn't seem to be optimizing correctly

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related