How to train Two models simultaneously?

ageryw · March 28, 2022, 11:43am

I have to train two models sequentially, where the loss of one model will somehow be used by the other model. Part of the code is as follows:

  # model1
  pred1 = model1(data)
  loss1 = loss_fn1(pred1, targets)
  optimizer1.zero_grad()
  loss1.backward()
  optimizer1.step()
     
  # model2
  pred2 = model2(data)
  loss2 = loss_fn2(pred2, targets)
  kl_loss= divergence_loss_fn(
    F.softmax(pred1/t, dim=1),
    F.softmax(pred2/t, dim=1)
  )
  loss = (1-alpha) * loss2 + alpha * kl_loss
  optimizer2.zero_grad()
  loss.backward()
  optimizer2.step()

If I run this cod as it is I face the following error:

RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.

If I make the first backward function as loss1.backward(retain_graph=True), then I face the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048, 10]], which is output 0 of TBackward, is at version 1565; expected version 1564 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Your help would be much appreciated.

ptrblck · March 29, 2022, 5:20am

In your current code snippet the kl_loss is created using pred1 and pred2 and will thus try to calculate the gradients for both models.
Since loss1.backward() and optimizer1.step() were already performed, this is invalid and will raise the errors.
Assuming you don’t want to train model1 anymore, you could detach() pred1 before passing it to divergence_loss_fn.

ageryw · March 29, 2022, 7:40pm

Thank you @ptrblck. Adding detach() before passing it to divergence_loss_fn solved the problem.