In an RNN, why don't we average the loss before backward?

Looking at the tutorial here, we see

The magic of autograd allows you to simply sum these losses at each step and call backward at the end.

    loss = 0

    for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
        l = criterion(output, target_line_tensor[i])
        loss += l

    loss.backward()

So we just sum all errors and then do a backward pass based on the gradients that are calculated on this summed error. My question is why, for an RNN, we do not average over the loss of all predicted tokens. It seems more natural to me to take the average rather than the sum to get the average error of the input sequence.

I don’t think it would affect the gradients anyway, which are basically determined by l = citerion(...). Summing them up just makes it convenient since we can only call loss.backward() once.

If I understand correctly, we could in principle do something like this:

losses = []

for i in range(input_line_tensor.size(0)):
    output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
    l = criterion(output, target_line_tensor[i])
    losses.append(l)

for l in losses:
    l.backward()