How to learn the weights between two losses?

Hi Tony-Y,
Your example is great. I am a beginner in pytorch. I am using multi-task approach for two different task and want to adopt this approach. I have a resnet50 as backbone and added two branch fro two different task. Now I want to use this Multitask loss for these two task. Can you please briefly show How do i use your MultiTaskLoss class in my case? Below is my code


import torch
import torch.nn as nn
import torch.nn.functional as F


class multi_output_model(torch.nn.Module):    
     def __init__(self, model_core,cup_nodes,bbc_type_nodes):
        super(multi_output_model, self).__init__()

        self.resnet_model = model_core
        self.cup_nodes = cup_nodes
        self.bbc_type_nodes = bbc_type_nodes
          
           
        ''' heads ______________ '''
        self.y1o = nn.Linear(256,self.cup_nodes)
        nn.init.xavier_normal_(self.y1o.weight)
        self.y2o = nn.Linear(256,self.bbc_type_nodes)
        nn.init.xavier_normal_(self.y2o.weight)
        
    def forward(self, x):
       
        x1 = self.resnet_model(x)
        y1o = F.softmax(self.y1o(x1),dim=1)  
        y2o = torch.sigmoid(self.y2o(x1))   
        return y1o, y2o

Now I am calling this like:

model= multi_output_model(pretrainedImgnetModel,cup_nodes,bbc_type_nodes)
criterion = [nn.CrossEntropyLoss(),nn.CrossEntropyLoss()] # two loss function for two task

and during training I was doing like

loss0 = criterion[0](outputs[0], torch.max(cup.float(), 1)[1])
loss1 = criterion[1](outputs[1], torch.max(bbc_type.float(), 1)[1])
totalLoss = loss0+loss1

Thanks a lot in advance.

First of all, you need to check the document of nn.CrossEntropyLoss. F.softmax should not be applied to y1o because it is included in CrossEntropyLoss. In addition, you should confirm whether the application of sigmoid to y2o is appropriate.

Hi Tony-Y,
My branch y1o is for multi label classification and y2o is for binary classification. So to get values between 0 to 1, sigmoid for y2o and for summing up all ouput probabilities to 1 for multi label using softmax to y1o.
Please correct me If I am wrong.

The answer of this question is helpful. For binary classifications, there are three methods.

Thanks a lot Tony. I have read it. But for Binary classification BCELoss and CrossEntropy Loss should be same. Is not it? In that case my code should be ok ?

It would be nice if you could little bit explain How I can adopt MultiTaskLoss for my case :slight_smile:

BCELoss is not the same as CrossEntropyLoss. Before considering multitask learning, you have to learn how to use loss functions in PyTorch.

Have you figured out why losses do not change? I have the same problem: losses keep stable while values of sigma are changing.

I think we should not use torch.nn.XX or torch.nn.functional.XX to get losses in the forward function. For those who are stuck here because losses do not change, I have reimplemented the example from the author of the paper using PyTorch: PyTorch Exmple.

1 Like


I wrote an example code and it seemed to be working.
It might be the key to make optimizers recognize the learnable parameters (multi task loss’s sigmas).

1 Like

Hi,
You mentioned the usage as:

usage
is_regression = torch.Tensor([True, True, False]) # True: Regression/MeanSquaredErrorLoss, False: Classification/CrossEntropyLoss

multitaskloss_instance = MultiTaskLoss(is_regression)

So in case of classification problem I should put
is_regression = False
can clarify it a bit ?

Thank you for your comment.

Yes, that’s right.
If you have loss1, loss2, and loss3, which are cross entropy loss, cross entropy loss, and MSE loss respectively, you should pass “is_regression = torch.Tensor([True, True, F
alse])” for the constructor.

I’d like to hear whether this multi task loss implementation works in your setting, too.

Can I ask where is this equation from? I cannot find it in the paper

The equation is derived from the last paragraph of the section “3.2. Multi-task likelihoods”.

1 Like

Dear all,
thanks for your hints on MTL loss. I am experimenting with MTL loss since a while now and I am facing some problems. I am in a situation where I am doing image to image regression with two different losses. After trying (many) different things I found that initializing sigmas to one gives the best tradeoff between performance across the two different tasks.

The problem is that one sigma is increasing and the other is decreasing. Can this be considered normal? Since my experiment is GPU intensive, I am experimenting with a short training schedule for prototyping. However, the increasing sigma lead to much worse performance when trying the full training schedule, leading to complete different (very bad) results.

Thanks for your help!
Stefano

I tried implementing @Tony-Y’s MultiTaskLoss() example above to balance the 3 loss terms in our https://github.com/ultralytics/yolov3 repo, but no luck. The total loss did reduce, but 90% of the loss component was made up of the self.eta constants. Our 3 balance parameters ended up between 1.0 and 2.0. One thing I realized may be missing is a constraint that all loss weights sum to 1 for example, but I did not explore further.

We have used hyperparameter evolution to successfully balance our loss terms (at great GPU expense), but this evolves to a specific task (i.e. YOLOv3 COCO with 80 classes), making it a suboptimal solution for loss balancing for people adopting the repo for their custom datasets.

    total_loss = torch.Tensor(loss) * torch.exp(-self.eta) + self.eta

Do not want to complicate things? Accumulate gradients.

loss1.backward()
loss2.backward()
optimizer.step()

Interesting idea, but

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Hey Stefano - maybe cannot be considered normal if you are getting very bad results when trying the full training schedule. Did you try using sigma and 1-sigma instead? So there is only one learnable parameter instead of two separate sigmas?

1 Like

Thank Stuart for your suggestion!
I haven’t tried exactly that, but I will give it a try! After some time what i noticed is that I was getting better results with the “short” prototyping schedule because both variances where slighlty increasing (thus decreasing both losses). This made so I got better (almost acceptable) results while prototyping, but bad results when tried entirely.

My guess is that my two tasks are correlated and strongly depend on the network training bootstrap phase, somehow invalidating the hypotheses of the MTL paper. Maybe task independence necessary for log likelihood estimation? I am just guessing

Yeah give that a go, it could sort the problem, I am interested to hear if it does! If you think the two tasks are not independent, it might be possible to test that theory. If they are very tightly linked, perhaps the internal learned feature representations (or even the intial raw inputs) for one task chould be good for the other. I don’t know all the details of your use case but if you tried to use features from one task with the targets from your other task, that could be useful insight.

1 Like