I see that for most of the implementations in pytorch, it is common to calculate the net output of a mini-batch by running it through a model, and then calculating and normalizing the loss. This implies that the loss (on which backward() is called) seems to be an arithmetic mean of all the losses produced by individual samples of the mini-batch at the output layer.
My question is, does this process also take place at every layer? Do we normalize the loss at a given layer, use that same loss for all activations of that layer and compute the gradients for back-propagation? It would be really helpful if someone could provide a reference to the modules in pytorch that actually do this when backward is called.
This issue is with most of the implementations, but for you can for example check out the pytorch imagenet implementation. Upon digging further I found that we can give reduce=False as an argument to the loss function, which will prevent averaging out the loss.
However, the question remains as to whether the same thing happens (by a way of default) in case of hidden layers during back-propagation?