I’m confused about how Pytorch works in a batch setting. For the longest time, I had assumed that you compute individual losses for each sample in a batch, average them and compute the gradients with respect to that loss and use that to update your parameters.
However, I’ve recently some things which make me unsure of my assumption. What I’ve read is that you compute the loss for each individual sample, and for each sample, you compute different gradients. So if your batch size is 128, you essentially compute a set of 128 gradients for each parameter and before updating the parameter, you average/sum them.
Which one of these is closer to what actually happens in Pytorch?