How is the loss for an epoch reported?

stark · August 27, 2020, 11:37am

Let’s assume, there are 10 examples in a dataset and the batch size is 4.

The number of exmaples per batch and training loss will be:

Note: By default nn.MSELoss() returns mean loss for an entire batch!

Which of these is the loss for an epoch?

A. Mean of individual losses over the training dataset:

Σ(mean_batch_loss * batch_size) / Σ(batch_size)
= ((27.527*4)+(10.503*4)+(5.6534*2)) / (4+4+2)
= (110.11 + 42.0116 + 11.3068) / 10
= 163.428 / 10 = 16.3428

B. Mean of the mean batch losses:

Σ mean_batch_loss / total_batches
= (27.527 + 10.503 + 5.6534*2) / total_batches
= 43.6837 / 3 
= 14.5612

C. Exponentially weighted moving average (EWMA)
s(i) = a * x(i) + (1-a) * x(i-1)
where a is a smoothing factor set to 0.1 and s(0) = 27.527
s(0) = 27.527
s(1) = 25.825
s(2) = 23.808

D. Differently than above

Please also comment on the way validation loss is computed differently.

tom · August 28, 2020, 6:52am

Good question!

I would view A as the correct way of doing it and B as a (common) approximation. If the number of batches is large and the last error isn’t particularly off, it might not matter as much.
One thing to keep in mind is that during training, the loss is also changed (hopefully decreasing) by the updates. This makes, in my view, some sort of moving average, like exponentially weighted moving average a contender for C. Depending on how fast the decay is, the last batch behaviour may be a bit more touchy there. In validation, on the other hand, the there aren’t updates and so there isn’t that much reason to use a moving average scheme.
I don’t think that the loss calculation itself is inherently different for validation (it would defeat the purpose a bit, too), but note that the way the model operates (dropout, batch norm) may be different.

The other thing to keep in mind is that small batch size may have other implications, e.g. for batch norm, adaptive optimizers like adam etc. If you have doubts, it may be safer to drop the last batch.

Best regards

Thomas

stark · August 28, 2020, 10:00am

Thanks for the reply

I’ve included your suggestion about exponentially weighted moving average in the original post as the option C.

Don’t you think the final EWMA value i.e. s(2) = 23.808 is an overestimate when compared with the final epoch loss obtained from each other method (16.342 and 14.561)? Another issue could be finding an optimal value for the smoothing parameter a.

tom · August 28, 2020, 4:59pm

The vanilla version is not really good for very short sequences, and there is a commonly used scheme to deal with the beginning (where you keep a MA of the weight/batch size as well as the value and then you divide the two, sometimes dubbed debiasing).
While this has a lot of warts, it does serve the purpose of giving a meaningful estimate if we believe that the value changes over the course of the training.

stark · April 28, 2021, 6:08am

@tom Thanks for the reply!

where you keep a MA of the weight/batch size as well as the value and then you divide the two, sometimes dubbed debiasing

Do you mean a running_loss by moving average (MA) like it is implemented in the following example? (from Training a Classifier — PyTorch Tutorials 2.1.1+cu121 documentation).

In case your reply is positive then it is Option A. in my initial post.

for epoch in range(total_epochs): 
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

tom · April 28, 2021, 10:19am

No, the thing I had in mind as C is what is done e.g. in the optimizers or discussed (not quite as clearly as I would need to understand it) on the moving average wikipedia page, see the paragraph starting with It can also be calculated recursively without introducing the error ….

This isn’t the average over the entire dataset as it weights the later losses more than the former.

Best regards

Thomas

neoglez · July 5, 2021, 1:49pm

I just want to point out that there is small error in B. substitution. That confused me a little, so I actually checked it with the calculator.
(27.527 + 10.503 + 5.6534*2) / total_batches: While the formula is correct, the multiplication by two in the numerator is wrong. Also, the result (14.5612) is correct.
If the multiplication by two would be correct, the result should have been 16.44.
So, the correct substitution is: (27.527 + 10.503 + 5.6534) / total_batches
= (27.527 + 10.503 + 5.6534) / 3
= 14.5612