For the network such as SSD, there is only one output layer. So the loss is calculated by the loss function, and loss.backward() is then used to calculate gradients of each layer from the output layer.However, for the network such as RetinaNet which utilizes FPN structure, there are multiple output layers.
And in https://github.com/kuangliu/pytorchretinanet/blob/2d7c663350f330a34771a8fa6a4f37a2baa52a1d/
train.py#L75 . The losses from multiple output layers are added up, and the loss.backward() is used based on the total loss.
So, my question is why the loss should be added up beform loss.backward() rather than loss1.backward(), loss2.backward()… for each output layer respectively?