How can I use different losses to update different branches respectively, and sum grad together to update master(main) branch

RealRain · July 25, 2018, 11:50am

I am facing a multi-label problem. My network has 2 branches, which are separated from a conv layers.
There is the sketch of my network

class MultiLabelDemo(nn.Module):
    def __init__(self):
        super(MultiLabelDemo, self).__init__()
        self.main_block = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=96, out_channels=96, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=96, out_channels=96, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)
        )
        self.tail_block1 = nn.Sequential(
            nn.Conv2d(in_channels=96, out_channels=256, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)
        )
        self.tail_block2 = nn.Sequential(
            nn.Conv2d(in_channels=96, out_channels=256, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)
        )

    def forward(self, data):
        x = self.main_block(data)
        y1 = self.tail_block1(x)  
        y2 = self.tail_block2(x)

        return y1, y2

branch_1 is for 1st label and branch is for 2nd label, so loss_1 is computed as nn.CrossEntropyLoss(y1, label_1) and loss_2 is computed as nn.CrossEntropyLoss(y2, label_2).
The whole network has only one optimizer (SGD).(I’m not sure whether it is correct.).

What I want to implement is that loss_1 is used to update the weights and bias in tail_block1 and loss_2 is used to update the weights and bias in tail_block2. when backprop progress goes to the layer when network was separated, add two grad together and update the remaining main_block. How can I do this, I’ve searched a lot but nothing found. I’d appreciate your kindly help.

ptrblck · July 25, 2018, 2:21pm

Your approach would work out of the box in PyTorch.
You could just add both of your losses and call backward on it.
This will make sure that the tails of your model will use the appropriate gradients, while the common part uses the accumulated gradient.

RealRain · July 26, 2018, 1:16am

I‘ve seen several similar problems.’https://discuss.pytorch.org/t/how-to-train-the-network-with-multiple-branches/2152. So I don’t think it is out of the box of pytorch, and I would like to double check whether it works.

ptrblck · July 26, 2018, 8:22am

Sure, it’s always good to double check an approach.
In your use case you don’t have to modify anything in a fancy way, but rather just add the different losses together as shown in the thread you’ve linked.

If you are using different criteria for both tails, I would try to make sure the losses have approx. the same range, e.g. by scaling them.

RealRain · July 26, 2018, 8:46am

Thanks for your kindly reply.
I am putting all of my looses into a python list and use a torch.autograd.backward() function to do back propagation. I am running several tests to see if it is correct.

ptrblck · July 26, 2018, 8:49am

I’m not sure if you will encounter this error, but if you want to call backward sequentially for your losses, you might want to specify retain_graph=True in the .backward() call, as the intermediate variables will be cleared otherwise.

justusschock · July 26, 2018, 9:01am

If you call backward() on each loss separately you will for sure encounter this error. The easier way would be to sum up the losses. Both approaches are equivalent since the gradient of multiple backward calls will be accumulated.

RealRain · July 26, 2018, 9:54am

In fact I am pretty new to pytorch. since I’ve used caffe for several years.
I am reading the documentation of pytorch again and again.

RealRain · July 26, 2018, 10:03am

Thanks for your suggestion. I’m gonna to check my code.

Naruto-Sasuke · July 27, 2018, 1:24am

Just add all loss together and call .backward() once! DO NOT respectively call ‘loss1.backward’ and ‘loss2.backward’, since you do not add ‘retain_graph =True’ in the module of ‘main_block’.
A good tutorial of ‘retain_graph=True’ is here https://github.com/pytorch/tutorials/tree/master/advanced_source

RealRain · July 27, 2018, 2:18am

Based on your reply, is the following code right?

optimizer = optim.SGD(params=my_params_list, lr=....)

loss1_func = nn.CrossEntropyLoss()
loss2_func = nn.CrossEntropyLoss()

loss1 = loss1_func(y1, target1)
loss2 = loss2_func(y2. target2)

loss = sum([loss1, loss2])  # or loss = loss1 + loss2
loss.backward()
optimizer.step()

from previous reply, I’ve changed my code into:

optimizer = optim.SGD(params=my_params_list, lr=....)

loss1_func = nn.CrossEntropyLoss()
loss2_func = nn.CrossEntropyLoss()

loss1 = loss1_func(y1, target1)
loss2 = loss2_func(y2. target2)

loss1.backward(retain_gragh=True)
loss2.backward(retain_gragh=True)

optimizer.step()

Thanks for your suggestion.

Naruto-Sasuke · July 27, 2018, 3:59am

The first is elegant and correct. ‘loss = loss1+loss2’ is ok.
The second way may consume more memory ‘loss1.backward(retain_graph=True)’ makes the whole branch(including the main_block) stores the intermediate variables

RealRain · July 27, 2018, 5:58am

Thank you very much for your help.

justusschock · July 27, 2018, 6:12am

Note that you should zero the gradients before calling backward via optim.zero_grad()

RealRain · July 27, 2018, 6:30am

yeah. I did call optimizer.zero_grad() at the beginning of my training code block.

justusschock · July 27, 2018, 6:32am

You should do this every time before calling backward somewhere (except you are dealing with RNNs). Otherwise the gradient values of all losses will be accumulated