Hi Florian!
SGD
converges just fine with a fixed learning rate. Here is a simple example
script with SGD
(with a fixed learning rate and no momentum):
import torch
print (torch.__version__)
x = torch.ones (1, requires_grad = True)
t = torch.tensor ([16.0])
opt = torch.optim.SGD ([x], lr = 0.01)
loss_fn = torch.nn.MSELoss()
for i in range (20):
opt.zero_grad()
p = x**2 # "model" is x**2
loss = loss_fn (p, t)
loss.backward()
opt.step()
print (x, loss)
And here is its output:
1.13.0
tensor([1.6000], requires_grad=True) tensor(225., grad_fn=<MseLossBackward0>)
tensor([2.4602], requires_grad=True) tensor(180.6336, grad_fn=<MseLossBackward0>)
tensor([3.4391], requires_grad=True) tensor(98.9550, grad_fn=<MseLossBackward0>)
tensor([4.0131], requires_grad=True) tensor(17.4123, grad_fn=<MseLossBackward0>)
tensor([3.9963], requires_grad=True) tensor(0.0110, grad_fn=<MseLossBackward0>)
tensor([4.0010], requires_grad=True) tensor(0.0009, grad_fn=<MseLossBackward0>)
tensor([3.9997], requires_grad=True) tensor(6.9633e-05, grad_fn=<MseLossBackward0>)
tensor([4.0001], requires_grad=True) tensor(5.4771e-06, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(4.3050e-07, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(3.3528e-08, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(2.4593e-09, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(1.7826e-10, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(1.4552e-11, grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(3.6380e-12, grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
It is true that complex models can converge (very) slowly and that training
can sometimes be improved by varying the learning rate (e.g., “cycling” it)
or by using momentum or fancier optimizers.
It is also possible (but, in my experience, unlikely) for training of a model
to converge to a true local minimum, rather than to the “correct” solution
of a lower local minimum or the actual global minimum. Increasing the
learning rate a lot so that a large optimization step jumps out of the local
minimum can get training to progress further (but that’s kind of like whacking
your starter motor with a sledgehammer to get your car started …).
Best.
K. Frank