How to learn the weights between two losses?

Thank you for your comment.

Yes, that’s right.
If you have loss1, loss2, and loss3, which are cross entropy loss, cross entropy loss, and MSE loss respectively, you should pass “is_regression = torch.Tensor([True, True, F
alse])” for the constructor.

I’d like to hear whether this multi task loss implementation works in your setting, too.

Can I ask where is this equation from? I cannot find it in the paper

The equation is derived from the last paragraph of the section “3.2. Multi-task likelihoods”.

1 Like

Dear all,
thanks for your hints on MTL loss. I am experimenting with MTL loss since a while now and I am facing some problems. I am in a situation where I am doing image to image regression with two different losses. After trying (many) different things I found that initializing sigmas to one gives the best tradeoff between performance across the two different tasks.

The problem is that one sigma is increasing and the other is decreasing. Can this be considered normal? Since my experiment is GPU intensive, I am experimenting with a short training schedule for prototyping. However, the increasing sigma lead to much worse performance when trying the full training schedule, leading to complete different (very bad) results.

Thanks for your help!
Stefano

I tried implementing @Tony-Y’s MultiTaskLoss() example above to balance the 3 loss terms in our https://github.com/ultralytics/yolov3 repo, but no luck. The total loss did reduce, but 90% of the loss component was made up of the self.eta constants. Our 3 balance parameters ended up between 1.0 and 2.0. One thing I realized may be missing is a constraint that all loss weights sum to 1 for example, but I did not explore further.

We have used hyperparameter evolution to successfully balance our loss terms (at great GPU expense), but this evolves to a specific task (i.e. YOLOv3 COCO with 80 classes), making it a suboptimal solution for loss balancing for people adopting the repo for their custom datasets.

    total_loss = torch.Tensor(loss) * torch.exp(-self.eta) + self.eta

Do not want to complicate things? Accumulate gradients.

loss1.backward()
loss2.backward()
optimizer.step()

Interesting idea, but

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Hey Stefano - maybe cannot be considered normal if you are getting very bad results when trying the full training schedule. Did you try using sigma and 1-sigma instead? So there is only one learnable parameter instead of two separate sigmas?

1 Like

Thank Stuart for your suggestion!
I haven’t tried exactly that, but I will give it a try! After some time what i noticed is that I was getting better results with the “short” prototyping schedule because both variances where slighlty increasing (thus decreasing both losses). This made so I got better (almost acceptable) results while prototyping, but bad results when tried entirely.

My guess is that my two tasks are correlated and strongly depend on the network training bootstrap phase, somehow invalidating the hypotheses of the MTL paper. Maybe task independence necessary for log likelihood estimation? I am just guessing

Yeah give that a go, it could sort the problem, I am interested to hear if it does! If you think the two tasks are not independent, it might be possible to test that theory. If they are very tightly linked, perhaps the internal learned feature representations (or even the intial raw inputs) for one task chould be good for the other. I don’t know all the details of your use case but if you tried to use features from one task with the targets from your other task, that could be useful insight.

1 Like

For this loss, we estimate the logarithm of the variance.
When the outputs of the network are very small (for example 1e-4), self.eta is negative because of the logarithm function. If the self.eta is greater then the torch.Tensor(loss) * torch.exp(-self.eta) then the total loss is negative.

What do you think when you estimate the very small float numbers as output of the NN as a regression operation.

Please see my experiment using a linear model bellow.

MultiTaskLoss.ipynb · GitHub

In this experiment, I used torch.stack instead of torch.Tensor to fix the reported bug of my original code as the following:

total_loss = torch.stack(loss) * torch.exp(-self.eta) + self.eta

total_loss and eta’s should be negative when original losses are converged to zero.

Reference for the linear model: https://arxiv.org/pdf/1905.11286v2.pdf
(Section 4: Experiments With Deep Linear Networks)

2 Likes

Hi I used your code but I want to put my data on cuda. both the input,target, and parameters of multitaskloss have been put to cuda. But unfortunately I got error like:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
could you please offer me some clues to check this problem?

Did you use cuda() or to(device) as the follwing?

mtl = MultiTaskLoss(model=MultiTaskModel(),
                    loss_fn=[loss_fn1, loss_fn2],
                    eta=[1.0, 1.0]).cuda()

Hi, I use this multi-task loss and have some questions. Here are the questions: Total loss of multi-task model
Could you please help me in improving the accuracy of classification ?
is_regression = torch.Tensor([False,True]). and loss_1 is mse loss while loss_2 is cross-entropy loss.

greetings i tested alot of things:

i trained the models speretaly and it seems they work without the multitask.

I also took your advice and did not make a custom initialization. It works good but the problem with multitask is still there:

I think its because of the loss function.

Currently i just add the loss together, put the sum in backprop. But that doesnt seem to work. Is there a good tutorial or way to make the loss for multitask pretrained models?

I integrated the Multilossfunction from this thread:

But it still does not work. I can train them seperately but together they dont work (one task accurcay rises, while the other stays low)

I use crossentropy for both.

model = Resnet50_multiTaskNet().to(device)        
criterion = [nn.CrossEntropyLoss(), nn.CrossEntropyLoss()]

def loss_fn1(x, cls):
    return 2 * criterion[0](x, cls)
def loss_fn2(x, cls):
    return 2 * criterion[1](x, cls)

mtl = MultiTaskLoss(model=model,
                    loss_fn=[loss_fn1, loss_fn2],
                    eta=[1.0, 1.0]).to(device)  



optimizer = optim.Adam(mtl.parameters())

class Resnet50_multiTaskNet(nn.Module):
    def __init__(self):
        super(Resnet50_multiTaskNet, self).__init__()
        
        self.model =  models.resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)

        for param in self.model.parameters():
            param.requires_grad = False 

        self.fc_artist = nn.Linear(2048, class_length ['artist']).to(device)
        self.fc_style = nn.Linear(2048, class_length ['style']).to(device)

    def forward(self, x):
        x = self.model.conv1(x)
        x = self.model.bn1(x)
        x = self.model.relu(x)
        x = self.model.maxpool(x)

        x = self.model.layer1(x)
        x = self.model.layer2(x)
        x = self.model.layer3(x)
        x = self.model.layer4(x)
        x = self.model.avgpool(x)
        x = x.view(x.size(0), -1)

        x_artist = self.fc_artist(x)
        x_style = self.fc_style(x)
        return x_artist, x_style
    
#multitaskloss
class MultiTaskLoss(nn.Module):
    def __init__(self, model, loss_fn, eta) -> None:
        super(MultiTaskLoss, self).__init__()
        self.model = model
        self.loss_fn = loss_fn
        self.eta = nn.Parameter(torch.Tensor(eta))

    def forward(self, input, targets) -> Tuple[torch.Tensor, torch.Tensor]:
        outputs = self.model(input)
        loss = [l(o,y) for l, o, y in zip(self.loss_fn, outputs, targets)]
        total_loss = torch.stack(loss) * torch.exp(-self.eta) + self.eta
        return loss, total_loss.sum(), outputs  # omit 1/2

Anyone has an idea why?

" A Simple General Approach to Balance Task Difficulty in Multi-Task Learning"

This paper summaries multi-task learning methods. You should first try the direct sum approach after examining the models separately.

1 Like

thank you for the paper!

is there a built in function for the minimize of the loss results? i use torch.

If you want get the minimum value, you can use torch.min.

1 Like

i tried your approach but i get loss in like that:
tensor(7.3046, device=‘cuda:0’, grad_fn=)
tensor(4.8561, device=‘cuda:0’, grad_fn=)

and then only the bigger loss seems to affect the backprop so only Task1 gets better accuracy while Task2 stays on a very very low acc

direct sum approach: i just summed the losses, could you give me an example how do modify that?

edit:

  1. i had mistake in my train loop

now it works

im gonna sleep and test it over the night

man im so stupid , i used the pred from task 1 also for task 2 for the accuracy

thank you for your help