For label with more than 1 dimension such as images, how should the loss be computed? Calculate it per image and then average over batch? Or calculate the loss on batch level?
They are computed element-wise (in the batch dim) and then averaged. If your sensor has N elementa, loss is computed pixel-wise for each element in the batch, averaged and then averaged in the batch dim
Thanks for your swift reply, but I’m waiting for more responses to see which is a better/more common practice.
The standard is to average over the batch, that is why it is coded that way by default. Anyway both ways are usually the same as if you compute L1 over each image and then average over batch is equivalent to compute L1 over all the images at once (always that each image has the same number of elements which is the case).
Besides, plenty of losses cannot be computed to be element-wise (eg. Classification losses such as cross entropy)
Anyway, no matter how do you compute it, gradients will just be rescaled by a factor but they will have same orientation in the hyper space. You will just have to adjust learning rate.
Hope it helps.