I’m trying to understand how the output from multiple loss functions within a network is applied to the network. For example if I have a network with 3 stages where there is a loss calculated at the end of each stage (the same loss function) how is that then applied in .backwards()
?
...
model = MyModel()
loss1, loss2, loss3 = model(inputs)
total_loss = loss1 + loss2 + loss3
optimizer.zero_grad()
total_loss.backward()
...
Is total_loss
used to calculate the grads across the whole network (all 3 stages) or does loss1
get used to calculate the grad for stage 1 and loss2
used for stage 2 AND stage 1 and loss3
used for stage 3, stage 2 AND stage 1? Or something else?
Thanks, I hope that makes sense. Some examples of this staged approach in the wild are the hourglass human pose paper and the convolutional pose machines paper.
Thanks,
Luke