I am working on a visual model with multiple outputs and thus multiple losses. I was under the impression that I could simply add the losses together and backpropagate over the aggregate. This school of thought seems quite common throughout the forums, for example here and here.
But I came across this StackOverflow thread that says there is an advantage with keeping the two losses separate if you use two different optimizers. For example, multiple Adam optimizers would each optimize the different output pathways. In that thread they say:
“Let’s say loss 1 varies rapidly with your parameters but is small in magnitude. You’d need small steps to optimize it, because it’s not smooth. And loss 2 varies slowly, but is big in magnitude. #2 will dominate their sum, so one shared ADAM will choose a big learning rate. But if you keep them separate, ADAM will chose a big learning rate for loss #2 and a small learning rate for loss #1”
I’m wondering how this would look in practice. Would this work?
optimizer1 = optim.Adam(model.parameters(), lr=lr, momentum=momentum)
optimizer2 = optim.Adam(model.parameters(), lr=lr, momentum=momentum)
for data, target in train_loader:
optimizer1.zero_grad()
optimizer2.zero_grad()
data = data.to(device)
target = target.to(device)
output = model(data)
loss1 = loss_fn(output, target)
loss2 = loss_fn2(output, target)
loss1.backward(retain_graph = True)
loss2.backward(retain_graph = True)
optimizer1.step()
optimizer2.step()
Retain_graph = True seems to be needed. based on that thread.
This seems to me to be advantageous, does anyone see any compute problems with this? Specifically what if I scaled up, and had many different output pathways and optimizers? for example 100? would this still be feasible?