Multiple outputs, losses, and optimizers

Joshua_Clancy · May 9, 2020, 9:04pm

I am working on a visual model with multiple outputs and thus multiple losses. I was under the impression that I could simply add the losses together and backpropagate over the aggregate. This school of thought seems quite common throughout the forums, for example here and here.

But I came across this StackOverflow thread that says there is an advantage with keeping the two losses separate if you use two different optimizers. For example, multiple Adam optimizers would each optimize the different output pathways. In that thread they say:

“Let’s say loss 1 varies rapidly with your parameters but is small in magnitude. You’d need small steps to optimize it, because it’s not smooth. And loss 2 varies slowly, but is big in magnitude. #2 will dominate their sum, so one shared ADAM will choose a big learning rate. But if you keep them separate, ADAM will chose a big learning rate for loss #2 and a small learning rate for loss #1”

I’m wondering how this would look in practice. Would this work?

optimizer1 = optim.Adam(model.parameters(), lr=lr, momentum=momentum)
optimizer2 = optim.Adam(model.parameters(), lr=lr, momentum=momentum)

for data, target in train_loader:
  optimizer1.zero_grad()
  optimizer2.zero_grad()
  data = data.to(device)
  target = target.to(device)
  output = model(data)
  loss1 = loss_fn(output, target)
  loss2 = loss_fn2(output, target)

  loss1.backward(retain_graph = True)
  loss2.backward(retain_graph = True)
  optimizer1.step()
  optimizer2.step()

Retain_graph = True seems to be needed. based on that thread.

This seems to me to be advantageous, does anyone see any compute problems with this? Specifically what if I scaled up, and had many different output pathways and optimizers? for example 100? would this still be feasible?

Dipam_Vasani · May 9, 2020, 9:13pm

This is a really good. I am not an experienced practitioner but I would really like to see how this compares against adding the losses together in some way. Did you try it on a prototype? What are the results?

Joshua_Clancy · May 10, 2020, 12:33am

Yeah I have been trying to implement but have run into problems. Specifically:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 10, 3, 3]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Still looking into a fix (speak up if you have an idea of how to )

futscdav · May 10, 2020, 1:11am

Unless your loss / model modifies the value in place, as the error says, it could be caused by calling optimizer1.step() before calculating loss2.backward(). Make sure it works with 1 optimizer before doing two.

Joshua_Clancy · May 10, 2020, 2:18am

you are good! thank you I was doing that the wrong way around.

I ended up with this (for varying amounts of outputs)

optimizerList = []
for n in range(output_num):  
  optimizer = getOptimizer(model.parameters(), run)
  optimizerList.append(optimizer)

for data, target in train_loader:
  data = data.to(device)
  target = target.to(device)

  output = model(data)

  lossList = []
  for n in range(output_num):  
    loss = loss_fn(output, target)
    lossList.append(loss)
  for n in range(output_num):  
    lossList[n].backward(retain_graph = True)
  for n in range(output_num):  
    optimizerList[n].step()

This seems to work.
I am still unsure about how compute-intensive this method is though. My gut is telling me that retaining multiple backward-pass graphs before the optimizer step could be a problem.