How to learn the weights between two losses?

Rosa · July 10, 2019, 8:58am

Hi Tony-Y,
Your example is great. I am a beginner in pytorch. I am using multi-task approach for two different task and want to adopt this approach. I have a resnet50 as backbone and added two branch fro two different task. Now I want to use this Multitask loss for these two task. Can you please briefly show How do i use your MultiTaskLoss class in my case? Below is my code


import torch
import torch.nn as nn
import torch.nn.functional as F


class multi_output_model(torch.nn.Module):    
     def __init__(self, model_core,cup_nodes,bbc_type_nodes):
        super(multi_output_model, self).__init__()

        self.resnet_model = model_core
        self.cup_nodes = cup_nodes
        self.bbc_type_nodes = bbc_type_nodes
          
           
        ''' heads ______________ '''
        self.y1o = nn.Linear(256,self.cup_nodes)
        nn.init.xavier_normal_(self.y1o.weight)
        self.y2o = nn.Linear(256,self.bbc_type_nodes)
        nn.init.xavier_normal_(self.y2o.weight)
        
    def forward(self, x):
       
        x1 = self.resnet_model(x)
        y1o = F.softmax(self.y1o(x1),dim=1)  
        y2o = torch.sigmoid(self.y2o(x1))   
        return y1o, y2o

Now I am calling this like:

model= multi_output_model(pretrainedImgnetModel,cup_nodes,bbc_type_nodes)
criterion = [nn.CrossEntropyLoss(),nn.CrossEntropyLoss()] # two loss function for two task

and during training I was doing like

loss0 = criterion[0](outputs[0], torch.max(cup.float(), 1)[1])
loss1 = criterion[1](outputs[1], torch.max(bbc_type.float(), 1)[1])
totalLoss = loss0+loss1

Thanks a lot in advance.

Tony-Y · July 10, 2019, 2:31pm

First of all, you need to check the document of nn.CrossEntropyLoss. F.softmax should not be applied to y1o because it is included in CrossEntropyLoss. In addition, you should confirm whether the application of sigmoid to y2o is appropriate.

Rosa · July 10, 2019, 3:16pm

Hi Tony-Y,
My branch y1o is for multi label classification and y2o is for binary classification. So to get values between 0 to 1, sigmoid for y2o and for summing up all ouput probabilities to 1 for multi label using softmax to y1o.
Please correct me If I am wrong.

Tony-Y · July 10, 2019, 3:28pm

The answer of this question is helpful. For binary classifications, there are three methods.

Rosa · July 11, 2019, 7:54am

Thanks a lot Tony. I have read it. But for Binary classification BCELoss and CrossEntropy Loss should be same. Is not it? In that case my code should be ok ?

It would be nice if you could little bit explain How I can adopt MultiTaskLoss for my case

Tony-Y · July 11, 2019, 8:04am

BCELoss is not the same as CrossEntropyLoss. Before considering multitask learning, you have to learn how to use loss functions in PyTorch.

as1949078 · July 13, 2019, 8:03am

Have you figured out why losses do not change? I have the same problem: losses keep stable while values of sigma are changing.

as1949078 · July 17, 2019, 7:09am

I think we should not use torch.nn.XX or torch.nn.functional.XX to get losses in the forward function. For those who are stuck here because losses do not change, I have reimplemented the example from the author of the paper using PyTorch: PyTorch Exmple.

ywatanabe1989 · August 9, 2019, 1:05pm

github.com

ywatanabe1989/custom_losses_pytorch/blob/master/multi_task_loss.py

import torch

class MultiTaskLoss(torch.nn.Module):
  def __init__(self, is_regression, reduction='none'):
    super(MultiTaskLoss, self).__init__()
    self.is_regression = is_regression
    self.n_tasks = len(is_regression)
    self.log_vars = torch.nn.Parameter(torch.zeros(self.n_tasks))
    self.reduction = reduction

  def forward(self, losses):
    dtype = losses.dtype
    device = losses.device
    stds = (torch.exp(self.log_vars)**(1/2)).to(device).to(dtype)
    self.is_regression = self.is_regression.to(device).to(dtype)
    coeffs = 1 / ( (self.is_regression+1)*(stds**2) )
    multi_task_losses = coeffs*losses + torch.log(stds)

    if self.reduction == 'sum':
      multi_task_losses = multi_task_losses.sum()

This file has been truncated. show original

I wrote an example code and it seemed to be working.
It might be the key to make optimizers recognize the learnable parameters (multi task loss’s sigmas).

Rosa · August 16, 2019, 7:08am

Hi,
You mentioned the usage as:

usage
is_regression = torch.Tensor([True, True, False]) # True: Regression/MeanSquaredErrorLoss, False: Classification/CrossEntropyLoss

multitaskloss_instance = MultiTaskLoss(is_regression)

So in case of classification problem I should put
is_regression = False
can clarify it a bit ?

ywatanabe1989 · August 16, 2019, 9:49am

Thank you for your comment.

Yes, that’s right.
If you have loss1, loss2, and loss3, which are cross entropy loss, cross entropy loss, and MSE loss respectively, you should pass “is_regression = torch.Tensor([True, True, F
alse])” for the constructor.

I’d like to hear whether this multi task loss implementation works in your setting, too.

Isaac_Kargar · January 28, 2020, 12:03pm

Can I ask where is this equation from? I cannot find it in the paper

Tony-Y · January 28, 2020, 12:42pm

The equation is derived from the last paragraph of the section “3.2. Multi-task likelihoods”.

Stefano_Savian · March 3, 2020, 10:50am

Dear all,
thanks for your hints on MTL loss. I am experimenting with MTL loss since a while now and I am facing some problems. I am in a situation where I am doing image to image regression with two different losses. After trying (many) different things I found that initializing sigmas to one gives the best tradeoff between performance across the two different tasks.

The problem is that one sigma is increasing and the other is decreasing. Can this be considered normal? Since my experiment is GPU intensive, I am experimenting with a short training schedule for prototyping. However, the increasing sigma lead to much worse performance when trying the full training schedule, leading to complete different (very bad) results.

Thanks for your help!
Stefano

glenn.jocher · March 21, 2020, 7:01am

I tried implementing @Tony-Y’s MultiTaskLoss() example above to balance the 3 loss terms in our https://github.com/ultralytics/yolov3 repo, but no luck. The total loss did reduce, but 90% of the loss component was made up of the self.eta constants. Our 3 balance parameters ended up between 1.0 and 2.0. One thing I realized may be missing is a constraint that all loss weights sum to 1 for example, but I did not explore further.

We have used hyperparameter evolution to successfully balance our loss terms (at great GPU expense), but this evolves to a specific task (i.e. YOLOv3 COCO with 80 classes), making it a suboptimal solution for loss balancing for people adopting the repo for their custom datasets.

    total_loss = torch.Tensor(loss) * torch.exp(-self.eta) + self.eta

ZdsAlpha · March 21, 2020, 4:35pm

Do not want to complicate things? Accumulate gradients.

loss1.backward()
loss2.backward()
optimizer.step()

hadaev8 · March 28, 2020, 10:21am

Interesting idea, but

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

stu17682 · July 24, 2020, 5:35pm

Hey Stefano - maybe cannot be considered normal if you are getting very bad results when trying the full training schedule. Did you try using sigma and 1-sigma instead? So there is only one learnable parameter instead of two separate sigmas?

Stefano_Savian · July 27, 2020, 2:56pm

Thank Stuart for your suggestion!
I haven’t tried exactly that, but I will give it a try! After some time what i noticed is that I was getting better results with the “short” prototyping schedule because both variances where slighlty increasing (thus decreasing both losses). This made so I got better (almost acceptable) results while prototyping, but bad results when tried entirely.

My guess is that my two tasks are correlated and strongly depend on the network training bootstrap phase, somehow invalidating the hypotheses of the MTL paper. Maybe task independence necessary for log likelihood estimation? I am just guessing

stu17682 · July 27, 2020, 3:39pm

Yeah give that a go, it could sort the problem, I am interested to hear if it does! If you think the two tasks are not independent, it might be possible to test that theory. If they are very tightly linked, perhaps the internal learned feature representations (or even the intial raw inputs) for one task chould be good for the other. I don’t know all the details of your use case but if you tried to use features from one task with the targets from your other task, that could be useful insight.