I still cannot reproduce the issue, so it would be great if you could post a minimal and executable code snippet which would show the issue.
In this code you can see that the output shape changes in the second iteration:
l=LSTM(150,1,1,150);
criterion = nn.MSELoss();
optimizer = torch.optim.Adam(l.parameters(), lr=0.01)
x = torch.randn(10, 10, 150)
y_output = l(x);
print(y_output.shape)
# torch.Size([1, 150])
y_output = l(x);
print(y_output.shape)
# torch.Size([10, 10, 150])
which seems at least uncommon.
The rest of the code also shows valid gradients:
y_true = torch.randn_like(y_output)
loss=criterion(l(x), y_true);
optimizer.zero_grad();
loss.backward();
optimizer.step();
print("Wfh.grad:");
print(l.Wfh.grad)
# tensor([[0.0018]])
for name, parms in l.named_parameters():
print('-->name:', name, '-->grad_requirs:', parms.requires_grad, \
' -->grad_value:', parms.grad.abs().sum(), '-->value:', parms.abs().sum())
# -->name: Wfh -->grad_requirs: True -->grad_value: tensor(0.0001) -->value: tensor(0.8921, grad_fn=<SumBackward0>)
# -->name: Wfx -->grad_requirs: True -->grad_value: tensor(0.0883) -->value: tensor(121.5771, grad_fn=<SumBackward0>)
# -->name: bf -->grad_requirs: True -->grad_value: tensor(0.0015) -->value: tensor(1.7418, grad_fn=<SumBackward0>)
# -->name: Wih -->grad_requirs: True -->grad_value: tensor(1.4800e-05) -->value: tensor(0.3881, grad_fn=<SumBackward0>)
# -->name: Wix -->grad_requirs: True -->grad_value: tensor(0.5420) -->value: tensor(120.5190, grad_fn=<SumBackward0>)
# -->name: bi -->grad_requirs: True -->grad_value: tensor(0.0131) -->value: tensor(0.3184, grad_fn=<SumBackward0>)
# -->name: Woh -->grad_requirs: True -->grad_value: tensor(0.0020) -->value: tensor(0.9126, grad_fn=<SumBackward0>)
# -->name: Wox -->grad_requirs: True -->grad_value: tensor(0.6173) -->value: tensor(136.6227, grad_fn=<SumBackward0>)
# -->name: bo -->grad_requirs: True -->grad_value: tensor(0.0181) -->value: tensor(1.3139, grad_fn=<SumBackward0>)
# -->name: Wch -->grad_requirs: True -->grad_value: tensor(0.0004) -->value: tensor(1.6613, grad_fn=<SumBackward0>)
# -->name: Wcx -->grad_requirs: True -->grad_value: tensor(0.7321) -->value: tensor(134.0075, grad_fn=<SumBackward0>)
# -->name: bc -->grad_requirs: True -->grad_value: tensor(0.0076) -->value: tensor(1.8784, grad_fn=<SumBackward0>)
# -->name: Wy -->grad_requirs: True -->grad_value: tensor(0.2735) -->value: tensor(103.7549, grad_fn=<SumBackward0>)
# -->name: by -->grad_requirs: True -->grad_value: tensor(1.5557) -->value: tensor(113.4924, grad_fn=<SumBackward0>)