How can I use different losses to update different branches respectively, and sum grad together to update master(main) branch

I am facing a multi-label problem. My network has 2 branches, which are separated from a conv layers.
There is the sketch of my network

class MultiLabelDemo(nn.Module):
    def __init__(self):
        super(MultiLabelDemo, self).__init__()
        self.main_block = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=96, out_channels=96, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=96, out_channels=96, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)
        )
        self.tail_block1 = nn.Sequential(
            nn.Conv2d(in_channels=96, out_channels=256, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)
        )
        self.tail_block2 = nn.Sequential(
            nn.Conv2d(in_channels=96, out_channels=256, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)
        )

    def forward(self, data):
        x = self.main_block(data)
        y1 = self.tail_block1(x)  
        y2 = self.tail_block2(x)

        return y1, y2


branch_1 is for 1st label and branch is for 2nd label, so loss_1 is computed as nn.CrossEntropyLoss(y1, label_1) and loss_2 is computed as nn.CrossEntropyLoss(y2, label_2).
The whole network has only one optimizer (SGD).(I’m not sure whether it is correct.).

What I want to implement is that loss_1 is used to update the weights and bias in tail_block1 and loss_2 is used to update the weights and bias in tail_block2. when backprop progress goes to the layer when network was separated, add two grad together and update the remaining main_block. How can I do this, I’ve searched a lot but nothing found. I’d appreciate your kindly help.

8 Likes

Your approach would work out of the box in PyTorch.
You could just add both of your losses and call backward on it.
This will make sure that the tails of your model will use the appropriate gradients, while the common part uses the accumulated gradient.

1 Like

I‘ve seen several similar problems.’https://discuss.pytorch.org/t/how-to-train-the-network-with-multiple-branches/2152. So I don’t think it is out of the box of pytorch, and I would like to double check whether it works.

Sure, it’s always good to double check an approach.
In your use case you don’t have to modify anything in a fancy way, but rather just add the different losses together as shown in the thread you’ve linked.

If you are using different criteria for both tails, I would try to make sure the losses have approx. the same range, e.g. by scaling them.

:grinning:
Thanks for your kindly reply.
I am putting all of my looses into a python list and use a torch.autograd.backward() function to do back propagation. I am running several tests to see if it is correct.

I’m not sure if you will encounter this error, but if you want to call backward sequentially for your losses, you might want to specify retain_graph=True in the .backward() call, as the intermediate variables will be cleared otherwise.

If you call backward() on each loss separately you will for sure encounter this error. The easier way would be to sum up the losses. Both approaches are equivalent since the gradient of multiple backward calls will be accumulated.

1 Like

In fact I am pretty new to pytorch. since I’ve used caffe for several years. :joy:
I am reading the documentation of pytorch again and again.:joy::joy::joy::joy::joy::joy:

Thanks for your suggestion. I’m gonna to check my code.

Just add all loss together and call .backward() once! DO NOT respectively call ‘loss1.backward’ and ‘loss2.backward’, since you do not add ‘retain_graph =True’ in the module of ‘main_block’.
A good tutorial of ‘retain_graph=True’ is here https://github.com/pytorch/tutorials/tree/master/advanced_source

Based on your reply, is the following code right?

optimizer = optim.SGD(params=my_params_list, lr=....)

loss1_func = nn.CrossEntropyLoss()
loss2_func = nn.CrossEntropyLoss()

loss1 = loss1_func(y1, target1)
loss2 = loss2_func(y2. target2)

loss = sum([loss1, loss2])  # or loss = loss1 + loss2
loss.backward()
optimizer.step()

from previous reply, I’ve changed my code into:

optimizer = optim.SGD(params=my_params_list, lr=....)

loss1_func = nn.CrossEntropyLoss()
loss2_func = nn.CrossEntropyLoss()

loss1 = loss1_func(y1, target1)
loss2 = loss2_func(y2. target2)

loss1.backward(retain_gragh=True)
loss2.backward(retain_gragh=True)

optimizer.step()

Thanks for your suggestion.

1 Like

The first is elegant and correct. ‘loss = loss1+loss2’ is ok.
The second way may consume more memory ‘loss1.backward(retain_graph=True)’ makes the whole branch(including the main_block) stores the intermediate variables

Thank you very much for your help. :hugs::hugs::hugs:

Note that you should zero the gradients before calling backward via optim.zero_grad()

1 Like

yeah. I did call optimizer.zero_grad() at the beginning of my training code block.

You should do this every time before calling backward somewhere (except you are dealing with RNNs). Otherwise the gradient values of all losses will be accumulated

1 Like