I am trying to implement the multi-task training to my own model (implementation: https://github.com/Hui-Li/multi-task-learning-example-PyTorch/blob/master/multi-task-learning-example-PyTorch.ipynb), and find some issues with combining the loss together.
I have three models to train, Model_1 and Model_2 are trained with nn.CrossEntropyLoss() and Model_3 is trained with nn.BCELoss().
how the training goes are like this:
loss = loss_1 + loss_2 + loss_3
loss = torch.mean(loss)
But actually I found that Model_2 can not be trained properly, as loss_2 hardly decreases. However, when I try to train Model_2 separately with identical model and data, the loss can normally decrease and the model is learning quite well.
What I want to know is how loss.backward() specifically works in such situation? I also tried to manually give loss_2 a big weight value, or make loss_1 = loss_3 = 0, but these methods do not help.
Can someone please explain a little bit about that? Thanks a lot. : )
Which model parameters have you fed into the optimizer?
Model_1, Model_2 or Model_3
Parameters of all three models are fed into the optimizer, the trained model is actually a “LossWrapper” instance. How my code works is generally the same as in the link: https://github.com/Hui-Li/multi-task-learning-example-PyTorch/blob/master/multi-task-learning-example-PyTorch.ipynb
precision1 = torch.exp(-self.log_vars)
loss = torch.sum(precision1 * (targets - outputs) ** 2. + self.log_vars, -1)
precision2 = torch.exp(-self.log_vars)
loss += torch.sum(precision2 * (targets - outputs) ** 2. + self.log_vars, -1)
seems to be summing log-likelihoods of two normal distributions. But as their precisions differ, their densities are scaled differently and are thus not comparable. That loss works though, at least with no shared parameters, you just have to examine part losses separately.
Now, with classification losses, underlying distributions have no precision parameters, so simple unweighted sum of losses should work. Indeed, trainable loss weights may easily screw things up. Again, shared parameters may impede training, but otherwise for loss = loss1+loss2, autograd.backward([loss]) is the same as autograd.backward([loss1,loss2]) (i.e. independent backprops) as SumBackward doesn’t change gradients (and MeanBackward just divides gradients by a constant).
Thanks for the reply. When I am training my demo, I actually did not add the ‘precision’ parameter, and the total loss is the torch.mean(torch.sum(loss_1 + loss_2 + loss_3)). And the training for loss_2 still does’t work.
My code has some differences with the code in the link: Multi-Task. 1. My loss wrapper receives three separate inputs, and get three separate output and losses, while the model in the link shares one common input. 2. I found that loss_2 has different scale than loss_1 and loss_3, it is much smaller.
However, If the problem is 1: multi-input problem, loss_1 and loss_3 works normally; and if the problem is 2: scale problem, set loss_1 = loss_3 = 0 or simply abandon them still does not help.
I am still so confused and may need to check more details about “loss.backward()”.
That doesn’t sound like backward() issue. Make sure your use same inputs & targets as when separately training model2. If first epoch’s loss_2 is the same (with same parameters and inputs), investigate why optimizer does different things.
Thanks a lot. It is the problem with the model_2. I changed the loss of model 2 from BCELoss to MSELoss, and the multi-task model can be trained normally.