Hi Florian!

`SGD`

converges just fine with a fixed learning rate. Here is a simple example

script with `SGD`

(with a fixed learning rate and no momentum):

```
import torch
print (torch.__version__)
x = torch.ones (1, requires_grad = True)
t = torch.tensor ([16.0])
opt = torch.optim.SGD ([x], lr = 0.01)
loss_fn = torch.nn.MSELoss()
for i in range (20):
opt.zero_grad()
p = x**2 # "model" is x**2
loss = loss_fn (p, t)
loss.backward()
opt.step()
print (x, loss)
```

And here is its output:

```
1.13.0
tensor([1.6000], requires_grad=True) tensor(225., grad_fn=<MseLossBackward0>)
tensor([2.4602], requires_grad=True) tensor(180.6336, grad_fn=<MseLossBackward0>)
tensor([3.4391], requires_grad=True) tensor(98.9550, grad_fn=<MseLossBackward0>)
tensor([4.0131], requires_grad=True) tensor(17.4123, grad_fn=<MseLossBackward0>)
tensor([3.9963], requires_grad=True) tensor(0.0110, grad_fn=<MseLossBackward0>)
tensor([4.0010], requires_grad=True) tensor(0.0009, grad_fn=<MseLossBackward0>)
tensor([3.9997], requires_grad=True) tensor(6.9633e-05, grad_fn=<MseLossBackward0>)
tensor([4.0001], requires_grad=True) tensor(5.4771e-06, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(4.3050e-07, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(3.3528e-08, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(2.4593e-09, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(1.7826e-10, grad_fn=<MseLossBackward0>)
tensor([4.0000], requires_grad=True) tensor(1.4552e-11, grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(3.6380e-12, grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
tensor([4.], requires_grad=True) tensor(0., grad_fn=<MseLossBackward0>)
```

It is true that complex models can converge (very) slowly and that training

can sometimes be improved by varying the learning rate (e.g., “cycling” it)

or by using momentum or fancier optimizers.

It is also possible (but, in my experience, unlikely) for training of a model

to converge to a *true local minimum*, rather than to the “correct” solution

of a lower local minimum or the actual global minimum. Increasing the

learning rate *a lot* so that a *large* optimization step jumps out of the local

minimum can get training to progress further (but that’s kind of like whacking

your starter motor with a sledgehammer to get your car started …).

Best.

K. Frank