Calculating loss per sample vs calculating loss per dataset

I have a dataset of around 4k images. I am iterating over that dataset approximately 1000 times (epochs). I am wondering, what is more efficient and leads to better generalization - calculating loss and doing 1 backpropagation per image or averaging out all losses from one epoch and then backpropagating?


averaging backprop over the whole dataset and then update corresponds to gradient descent while updating after backpropapgation over an image (or more generaly over a batch of images) corresponds to stochastic gradient descent (SGD). I believe so far SGD is considered to generalize better. I advice you to look for the latest results on that topic.

An additional thing if you choose to use SGD is to increase the size of the batch (it obviously depends on your input size but usually batch are larger especially if you have normalization layers aka batch norm). This is a common practice.

Hope this help!

1 Like

Thank you very much for your answer! It helped a lot :).