Simplest LSTM possible and it does not work

I am trying to do the simplest LSTM possible to predict a sequence. It works by loading a sequence in the LSTM module and then taking each item of the output through a linear model. I am training it 1000 times on the same sequence (0.05,0.1…1), that should become (0.1,0.05,…,1.1)

It does not really work. After 1000 iterations the loss is reduced a bit and the output resembles the right one in some ways. But it works really badly for having being trained so much on a sigle, simple, sequence.

I’d really appreciate any help in understanding why it performs so bad.

Here is the code:

import torch
import numpy as np

lstm = torch.nn.LSTM(input_size=1, hidden_size=100, num_layers=1, batch_first=True).to(torch.double)
linear = torch.nn.Linear(100,1).to(torch.double)
criterion = torch.nn.MSELoss()
params = list(lstm.parameters()) + list(linear.parameters())
optimizer = torch.optim.Adam(params, 1e-3)

L = 21

x = torch.tensor(np.arange(L)/20, dtype=torch.double).unsqueeze(0).unsqueeze(2)
y = torch.tensor(np.arange(2,L+2)/20,  dtype=torch.double)

for j in range(1000):
    hidden = (torch.zeros((1,1,100), dtype=torch.double), torch.zeros((1,1,100), dtype=torch.double))
    x = x
    y = y
    output, hidden = lstm(x, hidden)
    loss = 0.0
    for i in range(L):
        o = linear(output[:,i,:].squeeze())
        loss += criterion(o, y.squeeze()[i])
    print(loss)
    loss.backward()
    optimizer.step()

out_seq = torch.empty((L))
for i in range(L):
    o = linear(output[:,i,:].squeeze())
    out_seq[i] = o
    
print("output (expected): ", y)
print("output:", out_seq)

Output:
loss falls and rises, falls and rises again, but the trend is decreasing

output (expected):  tensor([0.1000, 0.1500, 0.2000, 0.2500, 0.3000, 0.3500, 0.4000, 0.4500, 0.5000,
        0.5500, 0.6000, 0.6500, 0.7000, 0.7500, 0.8000, 0.8500, 0.9000, 0.9500,
        1.0000, 1.0500, 1.1000], dtype=torch.float64)
output: tensor([ 0.1231, -0.2927, -0.2559, -0.0443,  0.0863,  0.1924,  0.2867,  0.3715,
         0.4473,  0.5152,  0.5764,  0.6319,  0.6827,  0.7296,  0.7734,  0.8146,
         0.8540,  0.8920,  0.9289,  0.9650,  1.0007], grad_fn=<CopySlices>)

What happens if you decrease the optimizer learning rate?

It improves but it is still outrageously bad. I tried with 1e-5 (lr) and 10000 epochs and the loss decreases further.

But I would be surprised if it is normal to train so much on a single sample to obtain still bad, but slightly better, results

Hi Tom,

The problem is that the accumulated gradients are not reset before each iteration. Therefore the optimizer is updating the weights based on not just the gradients from the current iteration but also all previous iterations. As explained in the torch.optim docs you need to call optimizer.zero_grad() to reset the gradients at each step.

With this diff:

 y = torch.tensor(np.arange(2,L+2)/20,  dtype=torch.double)
 
 for j in range(1000):
+    optimizer.zero_grad()
     hidden = (torch.zeros((1,1,100), dtype=torch.double), torch.zeros((1,1,100), dtype=torch.double))
     x = x
     y = y

I get a continuously decreasing loss and this final output:

tensor(0.0001, dtype=torch.float64, grad_fn=<AddBackward0>)
output (expected):  tensor([0.1000, 0.1500, 0.2000, 0.2500, 0.3000, 0.3500, 0.4000, 0.4500, 0.5000,
        0.5500, 0.6000, 0.6500, 0.7000, 0.7500, 0.8000, 0.8500, 0.9000, 0.9500,
        1.0000, 1.0500, 1.1000], dtype=torch.float64)
output: tensor([0.0925, 0.1473, 0.2005, 0.2522, 0.3025, 0.3522, 0.4014, 0.4504, 0.4996,
        0.5489, 0.5985, 0.6484, 0.6986, 0.7492, 0.7999, 0.8506, 0.9013, 0.9516,
        1.0013, 1.0502, 1.0981], grad_fn=<CopySlices>)
2 Likes

Thanks Robert, that was the problem

Thanks Robert!
I just withdrew my post