Hi,

In this line in the.word_language_model example, the validation loss is multiplied by `len(data)`

```
total_loss += len(data) * criterion(output, targets).item()
return total_loss / (len(data_source) - 1)
```

Two questions,

- Why is the loss multiplied by the
`len(data)`

i.e. the number of tokens in the batch? Especially considering that the training loss is not multiplied by this factor
- Why is the total_loss eventually divided by
`len(data_source) - 1`

and not just `len(data_source)`

? Why is `-1`

necessary here since we aim to get the # of batches in total?

Any help will be greatly appreciated. And thanks for these examples in the first place - really have made the transition to pytorch smoother.

Best,

The reason for these two are inter-connected.

If you see in the for loop: `for i in range(0, data_source.size(0) - 1, args.bptt):`

, the loop goes for length - 1 times.

Think of a situation where one batch is NOT full. So, for `args.bptt`

= 32, say we have just 17 examples in one batch. In that case, when we multiply all the batches by their len, it allows us to normalize the loss values. Hence, all the other batches will be multiplied by 32 and one batch will be multiplied by 17. Finally, we divide the total loss by `(len(data_source) - 1)`

to get average loss. Note, training loss doesn’t have any of it.

Thanks for the detailed reply Abhilash.

Your example makes sense. Then, a follow up question is why the training loss isn’t multiply by `len(data)`

while validation loss is?

Is there some relevant theoretical background underlying word language models that I’m missing?

There’s no underlying theoretical background that you are missing. You can do the same for the training loss and get similar results.