How to learn the weights between two losses?

I am reproducing the paper " Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics". The loss function is defined as

loss7

This means that W and σ are the learned parameters of the network. We are the weights of the network while σ are used to calculate the weights of each task loss and also to regularize this task loss wight.

It is easy to implement the L1 and L2 (assume they are L1 loss)

loss1 = nn.L1Loss()
loss2 = nn.L1Loss()
input1 = torch.randn(3, 5, requires_grad=True)
input2 = torch.randn(3, 5, requires_grad=True)
target= torch.randn(3, 5)
loss_total = 1/(2*sigma1^2)*loss1(input1, target) +  1/(2*sigma2^2)* loss2(input2, target) + log(sigma1*sigma2)
loss_total.backward()

However, the weight σ also learned. How can I make the σ learnable in the combined loss

3 Likes

The following code can learn the loss weights sigma. nn.Parameter is used for adding a tensor to the parameter list of the module.

import torch
import torch.nn as nn
import torch.optim as optim

class MultiTaskLoss(nn.Module):
    def __init__(self, tasks):
        super(MultiTaskLoss, self).__init__()
        self.tasks = nn.ModuleList(tasks)
        self.sigma = nn.Parameter(torch.ones(len(tasks)))
        self.mse = nn.MSELoss()

    def forward(self, x, targets):
       l = [self.mse(f(x), y) for y, f in zip(targets, self.tasks)]
       l = 0.5 * torch.Tensor(l) / self.sigma**2
       l = l.sum() + torch.log(self.sigma.prod())
       return l

f1 = nn.Linear(5, 1, bias=False)
f2 = nn.Linear(5, 1, bias=False)

x = torch.randn(3, 5)
y1 = torch.randn(3)
y2 = torch.randn(3)

mtl = MultiTaskLoss([f1, f2])

print(list(mtl.parameters()))

optimizer = optim.SGD(mtl.parameters(), lr = 0.1)
optimizer.zero_grad()
mtl(x, [y1, y2]).backward()
optimizer.step()

output:

[Parameter containing:
tensor([1., 1.], requires_grad=True), Parameter containing:
tensor([[-0.4190,  0.1006,  0.1092, -0.1402,  0.1945]], requires_grad=True), Parameter containing:
tensor([[ 0.3770, -0.4309, -0.1455,  0.1247,  0.0380]], requires_grad=True)]
6 Likes

Hi Tony,
Thank you for code multitask loss for 2 tasks. If i want class work for 3 or more classes.
Does the fomula is:
1/(3sigma1^2)loss1(input1, target) + 1/(3sigma2^2) loss2(input2, target) + 1/(3sigma3^2) loss3(input3, target) + log(sigma1sigma2sigma3)
Does its right?

1 Like

No.

Please see the equation (5) in the paper. The loss is the negated sum of three different terms represented as the eq (5).

tks for reply,
I try use your code with multi task loss for 2 taskes, but my model just try change parameter of sigma but loss just keep value. I don’t know what wrong with my code.
I dont write seperate loss class, I write loss function like code below.

class ModelTTestMultitask(nn.Module):
	def __init__(self, cf):
	    self.sigma = nn.Parameter(torch.ones(2))		
	
	def forward():
		""
		do_something()
	
	def loss(self, batch):
		loss_1 = self.output_layer_task_1.loss(batch.label_task_1)
		loss_2 = self.output_layer_task_2.loss(batch.label_task_2)
	
		loss_combine = 0.5 * torch.Tensor([loss_1, loss_2]) / self.sigma ** 2
		loss_combine = loss_combine.sum() + torch.log(self.sigma.prod())
		return loss_combine

If I don’t use loss with uncertainty like this and my loss function like below, everything can work normal, but loss of each task scale another, and final model not good as i hope.

	def loss(self, batch):
		loss_1 = self.output_layer_task_1.loss(batch.label_task_1)
		loss_2 = self.output_layer_task_2.loss(batch.label_task_2)
	
		loss_combine = 0.5 * (loss_1 + loss_2)		
		return loss_combine

Did not your experiment show a result such as Figure 7 in the paper?

My task not same like the paper, I try multi task for NLP project, ner and pos. But loss of each task not decrease, just value of sigma change, the sigma change very large like [121, -12] ( I don’t rememeber extractly) but the sigma keep change and loss does’nt change nothing each iterator

How about increasing the initial sigma? For example, 10, 100, 1000, etc.

I not try the idea increase the initial sigma, i just try init with sigma = [1, 1] and sigma = [0.5, 0.5]. I will try then report later. thank you.
And does you have any experience about this problem, and have any explain about my error.

I have no experience, but think the uncertainty is very large at the initial stage.

I will check and read again paper, thank you

In order to avoid numerical instability, we should use a variable change:

eta = log(sigma)

The new variable eta can be defined within (-oo, +oo).

Sample code:

import torch
import torch.nn as nn
import torch.optim as optim

class MultiTaskLoss(nn.Module):
    def __init__(self, model, loss_fn, eta):
        super(MultiTaskLoss, self).__init__()
        self.model = model
        self.loss_fn = loss_fn
        self.eta = nn.Parameter(torch.Tensor(eta))

    def forward(self, input, targets):
        outputs = self.model(input)
        loss = [l(o,y).sum() for l, o, y in zip(self.loss_fn, outputs, targets)]
        total_loss = torch.Tensor(loss) * torch.exp(-self.eta) + self.eta
        return loss, total_loss.sum() # omit 1/2

class MultiTaskModel(nn.Module):
    def __init__(self):
        super(MultiTaskModel, self).__init__()
        self.f1 = nn.Linear(5, 1, bias=False)
        self.f2 = nn.Linear(5, 1, bias=False)

    def forward(self, input):
        outputs = [self.f1(x).squeeze(), self.f2(x).squeeze()]
        return outputs

mtl = MultiTaskLoss(model=MultiTaskModel(),
                    loss_fn=[nn.MSELoss(), nn.MSELoss()],
                    eta=[2.0, 1.0])

print(list(mtl.parameters()))

x = torch.randn(3, 5)
y1 = torch.randn(3)
y2 = torch.randn(3)

optimizer = optim.SGD(mtl.parameters(), lr=0.1)
optimizer.zero_grad()
loss, total_loss = mtl(x, [y1, y2])
print(loss, total_loss)
total_loss.backward()
optimizer.step()

Output:

[Parameter containing:
tensor([2., 1.], requires_grad=True), Parameter containing:
tensor([[-0.0387,  0.3287,  0.2549,  0.3336,  0.0195]], requires_grad=True), Parameter containing:
tensor([[0.2908, 0.2801, 0.1108, 0.4235, 0.0308]], requires_grad=True)]
[tensor(3.3697, grad_fn=<SumBackward0>), tensor(2.1123, grad_fn=<SumBackward0>)] tensor(4.2331, grad_fn=<SumBackward0>)
4 Likes

I already find the example of paper in keras with code same like you. But I don’t know why my eta just keep increasing.
I give you some parameter per epoch:

epoch 1:
list loss: [ 211.6204, 283.3055,  276.5063] and eta: [5.0511, 5.0714, 5.0698]
epoch 2:
list loss: [210.646, 281.631, 275.2699] and eta: [5.2132, 5.2701, 5.2673]
epoch 3:
list loss: [ 211.3304, 282.8942, 276.3101] and eta: [5.3005, 5.4210, 5.4148]
epoch 4:
list loss: [ 211.3207, 282.6045, 276.2361] and eta: [5.3320, 5.5211, 5.5101]

If I don’t think wrong. the loss_1 = torch.Tensor(loss) * torch.exp(-self.eta) = [3.3475, 3.4172, 3.4132]
and loss_2 = self.eta = [5.3320, 5.5211, 5.5101]
loss_2 > loss_1 and if keep increase eta, the loss_2 is still greater more than loss _1. But why me eta still increase.
my code when compute loss:

self.eta = nn.Parameter(torch.Tensor(cf['eta']))
loss_combine = torch.cuda.FloatTensor([loss_1.sum(), loss_2.sum(), loss_3.sum()]) * torch.exp(-self.eta) + self.eta
#                 print("loss combine: ", loss_combine)
                loss_combine = loss_combine.sum()
return loss_combine

And 1 question about the solution: Approx. optimal weights mention in paper table 5. Does it use the sum weighted loss, mean i use grid search to choose the weighted for each loss and summarize like 1/2 * loss_1 + 1/3 * loss_2 + 1/5 * loss_3 ? Is it right?
And 1 question about the reason, why loss don’t in the same scale make the total loss uniform make the 1 task can converge and 2 task not converging. In my case, if i use simple loss sum uniform, loss_1 after 200 epoch approxi 0.5, loss_2 approxi 1.2, and loss 3 greate then 7. I try to search paper or more keyword but not have.

Thank you

I cannot answer soon. But, I think optimal weights are not used in a recent paper of natural language understanding:

Please see Algorithm 1 in this paper.

I have understood that the total loss was decreasing from the following calculation:

>>> def total_loss(loss, eta):
...     loss = torch.Tensor(loss)
...     eta = torch.Tensor(eta)
...     return (loss * torch.exp(-eta) + eta).sum()
... 
>>> total_loss([ 211.6204, 283.3055,  276.5063], [5.0511, 5.0714, 5.0698])
tensor(20.0620)
>>> total_loss([210.646, 281.631, 275.2699], [5.2132, 5.2701, 5.2673])
tensor(19.7656)
>>> total_loss([ 211.3304, 282.8942, 276.3101], [5.3005, 5.4210, 5.4148])
tensor(19.6715)
>>> total_loss([ 211.3207, 282.6045, 276.2361], [5.3320, 5.5211, 5.5101])
tensor(19.6332)

I think that the uncertainties increase in the beginning but begins to decrease after some epochs, as shown in Figure 7 of the paper. You might need to optimize the learning rate.

Because sigma^2 must be near loss, eta can be estimated using the initial losses as

>>> torch.log(torch.Tensor([ 211.6204, 283.3055,  276.5063]))
tensor([5.3548, 5.6465, 5.6222])

I think the maximum of eta is somewhat greater than the estimated value.

Figure 2 of the paper shows the performance depends on weights. The total loss is given by Equation (1) where the sum of weights is 1.

Because I am not an expert of multi-task learning, you should make a new topic about multi-task learning on this site.

1 Like

I will read more detail from paper. Anyway, thank you Tony.

You might have to use mean(), i.e. not sum().

This paper proposes total loss composed of MSE and CrossEntropy losses. Other losses are outside the scope of the assumption. An implementation for Equation (10) where y1 is a continuous output and y2 is a discrete output:

import torch
import torch.nn as nn
import torch.optim as optim

class MultiTaskLoss(nn.Module):
    def __init__(self, model, loss_fn, eta):
        super(MultiTaskLoss, self).__init__()
        self.model = model
        self.loss_fn = loss_fn
        self.eta = nn.Parameter(torch.Tensor(eta))

    def forward(self, input, targets):
        outputs = self.model(input)
        loss = [l(o,y) for l, o, y in zip(self.loss_fn, outputs, targets)]
        total_loss = torch.Tensor(loss) * torch.exp(-self.eta) + self.eta
        return loss, total_loss.sum() # omit 1/2

class MultiTaskModel(nn.Module):
    def __init__(self):
        super(MultiTaskModel, self).__init__()
        self.e  = nn.Linear(5, 5, bias=False)
        self.f1 = nn.Linear(5, 2, bias=False)
        self.f2 = nn.Linear(5, 3, bias=False)

    def forward(self, input):
        x = self.e(input)
        outputs = [self.f1(x), self.f2(x)]
        return outputs

## For the normal distribution,
loss_fn1 = nn.MSELoss()
## For the Laplace distribution,
# loss_fn1 = nn.L1Loss()
##
## Note the original work uses the L1 loss for Instance Segmentation
## and Depth Regression, as described at page 6.
## https://arxiv.org/abs/1705.07115
##

cel = nn.CrossEntropyLoss()
def loss_fn2(x, cls):
    return 2 * cel(x, cls)

mtl = MultiTaskLoss(model=MultiTaskModel(),
                    loss_fn=[loss_fn1, loss_fn2],
                    eta=[2.0, 1.0])

print(list(mtl.parameters()))

x = torch.randn(3, 5)
y1 = torch.randn(3, 2)
y2 = torch.LongTensor([0, 2, 1])

optimizer = optim.SGD(mtl.parameters(), lr=0.1)
optimizer.zero_grad()
loss, total_loss = mtl(x, [y1, y2])
print(loss, total_loss)
total_loss.backward()
optimizer.step()

All of my loss from 3 task are the same, its a CrossEntropyLosses. So I think

loss_combine_tensor = torch.cuda.FloatTensor([loss_1.sum(), loss_2.sum(), loss_3.sum()]) 
or
loss_combine_tensor = torch.cuda.FloatTensor([loss_1.mean(), loss_2.mean(), loss_3.mean()])

are the same value, just difference the gradfn = sum, or mean backward. Can you tell me what is the difference of each function when optimizer run ?

You need neither sum() nor mean() if you use CrossEntropyLoss() with default parameters. The CrossEntropyLoss must be multiplied by 2 according to Equation (10) in the paper. A sample code:

cel = nn.CrossEntropyLoss()
def loss_fn2(x, cls):
    return 2 * cel(x, cls)

sum() or mean() for loss_1, loss_2, and loss_3 doesn’t influence optimizations.