In short, I was working on some problems whose most degenerate forms can be linear. Hence I was able to reduce the non-converging cases to a very small linear regression problem that converges unreasonably slow with gradient descent.

I was under the impression that while solving linear optimization with gradient descent is not the most efficient way, it should nonetheless converge quite quickly and be a practical way to solve linear problems (so that non-linearities can be seamlessly added later). Among other things, linear regression is considered a standard introductory problem to gradient descent. Also many NNs are piece-wise linear. Now instead, I start to question the nature of my reality.

The problem is to minimize ||Ax-B||^2 (that is to solve Ax=B) like follows.

The loss starts at 100 and is expected to minimize to 0. Instead it converged impractically slow to be solvable with gradient descent.

```
import torch as t
A = t.tensor([
[-2.4969e+02, -4.1511e+00],
[-4.1511e+00, -2.0755e-01]])
B = t.tensor([-0., 10.])
#trivially solvable by lstsq
x_solved = t.linalg.lstsq(A,B)
print(x_solved)
#solution=tensor([ 1.2000, -72.1824])
print("check if Ax=B", A@x_solved.solution-B)
def forward(x_):
return (A@x_-B).pow(2).sum()
#sanity check with the lstsq solution
print("loss computed with the lstsq solution",forward(x_solved.solution))
x = t.zeros(2,requires_grad=True)
#learning_rate = 1e-7 #converging to 99.20282745361328 at T=1000000
#learning_rate = 1e-6 #converging to 92.60104370117188 at T=1000000
learning_rate = 1e-5 #converging to 46.44608688354492 at T=1000000
#learning_rate = 1.603e-5 # converging to 29.044937133789062 at T=1000000
#learning_rate = 1.604e-5 # diverging
#learning_rate = 1.605e-5 # inf
#learning_rate = 1.61e-5 # NaN
for T in range(1000001):
loss = forward(x)
if T % 100 == 0:
print(T, loss.item(),end='\r')
loss.backward()
with t.no_grad():
x -= learning_rate * x.grad
x.grad = None
print('converging to',loss.item(),f'at T={T} with lr={learning_rate}')
```

I have already gone to extra lengths finding a good learning rate - for normal “tuning” one would only try values such as 1e-5 or 2e-6 rather than pinning down multiple digits just below the point of divergence.

I have also tried unrolling the expression and ultimately computing the derivatives symbolically, which seemed to suggest that the pytorch grad was correct - it would have been hard to imagine that pytorch today still has a bug manifesting in such a simple case anyway. On the other hand it really baffles me if mathematically gradient descent indeed has such a weakness. Not yet exhaustively, but none of the optimizers from torch.optim worked for me either.

Did anyone know what I have encountered?