You are deleting the gradients after they were computed and before the weight updates were performed.
Try to move optimizer.zero_grad(), e.g.:
for epoch in range(epochs):
for i in range(len(X)):
optimizer.zero_grad()
for ele in X[i]:
output, hidden = rnn(Variable(ele.t()).cuda(), hidden.cuda())
loss = criterion(output, Variable(Y[i]).cuda())
loss.backward(retain_graph=True)
optimizer.step()
It sets all gradients to zero, i.e. is basically deletes all gradients from the Parameters, which were passed to the optimizer.
You need it, because the gradients won’t be cleared otherwise and thus they will be accumulated in each iteration.
It’s probably not the error, but you should call .cuda on the Tensor before wrapping it in a Variable (for example in this case Variable(Y[i]).cuda().
Could you check the gradients with rnn.i2h.weight.grad?
Also could you provide the shapes of X and Y?
print(rnn.i2h.weight.grad) gives me a 90x90 matrix consisting of all nan values. Also, each element of X i.e X[i] is a vector of length 90. Y is also a vector of length 90.