Looking at the tutorial here, we see
The magic of autograd allows you to simply sum these losses at each step and call backward at the end.
loss = 0
for i in range(input_line_tensor.size(0)):
output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
l = criterion(output, target_line_tensor[i])
loss += l
loss.backward()
So we just sum all errors and then do a backward pass based on the gradients that are calculated on this summed error. My question is why, for an RNN, we do not average over the loss of all predicted tokens. It seems more natural to me to take the average rather than the sum to get the average error of the input sequence.