Questions about loss averaging in seq2seq tutorial

In the great seq2seq tutorial

there’s a thing I don’t get. As you can see, loss is summed over all the decoder steps calculated independently (in for loop). After the summation, backward() and step() are called, and then, loss is averaged (divided) by all the decoder steps, for the visualization.

  1. Loss is one number, so the same one error value is propagated through all the input nodes, or it remember somehow which node bring what error?
  2. What happens, if we average the loss before backward() and step()? Is that only scaling, so we should use bigger learning rate or not?
  3. Isn’t that a problem for the optimizers, that long sentences will have big loss (sum over many steps), and short will have small loss?
1 Like