How does PyTorchs SGD work?

I have recently learned, that SGD does normally not converge, if the learning rate is fixed.
I wondered, how PyTorch handles the learning rate then when using

optimizer = optim.SGD()

as the optimizer, since it requires no parameters that seem to affect the learning rate. I wondered if momentum is actually enough to guarantee convergence? Or does the learning rate gets adjusted behind the scenes?

Hi Florian!

SGD converges just fine with a fixed learning rate. Here is a simple example
script with SGD (with a fixed learning rate and no momentum):

import torch
print (torch.__version__)

x = torch.ones (1, requires_grad = True)
t = torch.tensor ([16.0])
opt = torch.optim.SGD ([x], lr = 0.01)
loss_fn = torch.nn.MSELoss()

for  i in range (20):
    opt.zero_grad()
    p = x**2   # "model" is x**2
    loss = loss_fn (p, t)
    loss.backward()
    opt.step()
    print (x, loss)

And here is its output:

1.13.0
tensor([1.6000], requires_grad=True) tensor(225., grad_fn=<MseLossBackward0>)
tensor([2.4602], requires_grad=True) tensor(180.6336, grad_fn=<MseLossBackward0>)
tensor([3.4391], requires_grad=True) tensor(98.9550, grad_fn=<MseLossBackward0>)
tensor([4.0131], requires_grad=True) tensor(17.4123, grad_fn=<MseLossBackward0>)
tensor([3.9963], requires_grad=True) tensor(0.0110, grad_fn=<MseLossBackward0>)
tensor([4.0010], requires_grad=True) tensor(0.0009, grad_fn=<MseLossBackward0>)
tensor([3.9997], requires_grad=True) tensor(6.9633e-05, grad_fn=<MseLossBackward0>)
tensor([4.0001], requires_grad=True) tensor(5.4771e-06, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(4.3050e-07, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(3.3528e-08, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(2.4593e-09, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(1.7826e-10, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(1.4552e-11, grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(3.6380e-12, grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)

It is true that complex models can converge (very) slowly and that training
can sometimes be improved by varying the learning rate (e.g., “cycling” it)
or by using momentum or fancier optimizers.

It is also possible (but, in my experience, unlikely) for training of a model
to converge to a true local minimum, rather than to the “correct” solution
of a lower local minimum or the actual global minimum. Increasing the
learning rate a lot so that a large optimization step jumps out of the local
minimum can get training to progress further (but that’s kind of like whacking
your starter motor with a sledgehammer to get your car started …).

Best.

K. Frank

Hi Frank,

Thanks for your extensive response and explanation. However, your example is not really convincing to me. How does it differ from gradient descent? I am not understanding where the stochastic part comes into play, since you are not really dealing with different samples SGD could choose from?

Correct me if I am wrong though!

The “stochastic” part in SGD comes from computing the gradient for mini-batches of the dataset, since Gradient Descent involves calculating the gradient for the full dataset.
This forum post might be helpful.