How to train the network with multiple branches

WERush · April 22, 2017, 2:32pm

There is an example.

class mm(nn.Module):
    def __init__(self):
        super(mm, self).__init__()
        self.n = nn.Linear(4,3)
        self.m = nn.Linear(3,2)
        self.m2 = nn.Linear(3,4)
    def forward(self, input, input2):
        input_ = self.n(input)
        input2_ = self.n(input2)
        o1 = self.m(input_)
        o2 = self.m2(input2_)
        return o1, o2

I want to train the m and m2 with different losses. Do i need to propose two optimizers?
Please show me some examples.
Thanks

smth · April 29, 2017, 3:05pm

you can simply do:

o1, o2 = mm(input)
o = o1 + o2
# loss

## Or you can do

l1 = loss(o1, target)
l2 = loss2(o2, target2)
torch.autograd.backward([l1, l2])

zakaria_laskar · September 7, 2017, 11:26am

With respect to the above code, is it ok to pass the same input to two different networks, i.e

def forward(self, input, input2):
        input_ = self.n(input)
        o1 = self.m(input_)
        o2 = self.m2(input_)
        return o1, o2

or do I need to copy the input_ Variable and then pass the copy to m2 network.

smth · September 9, 2017, 1:26am

you can passs the same input to two different networks

brisker · November 13, 2017, 12:26pm

if pass the same input to different networks with different optimizers, what is the gradient like for this input? summed gradients from two branches?

smth · November 13, 2017, 2:58pm

yes it is summed. pytorch gradients are always accumulated.

brisker · November 13, 2017, 3:05pm

If so, how to apply different optimizers to the two different branches? I mean, branch 1 with loss_1, branch 2 with loss_2,
loss = 2*loss_1 + 3 *loss_2
loss.backward()
optimizer.step()
how to wrap the two branches into one optimizer,while this optimizer contains actually two optim methods(Adam–>branch_1 and SGD–>branch_2)?

smth · November 15, 2017, 8:51pm

opt1 = optim.Adam(branch_1.parameters(), ...)
opt2 = optim.SGD(branch_2.parameters(), ...)
...
...
loss = 2*loss_1 + 3 *loss_2
loss.backward()
opt1.step()
opt2.step()

spacemeerkat · December 3, 2018, 3:28pm

Can Pytorch handle backprop to separate branches if you concatenate the output of two branches into a single linear layer and then proceed to go deeper in the network until you calculate a final output?

For example:

Branch_1 takes channel 1 of the input image and performs convolutions.
Branch_2 takes channel 2 of the input image and performs convolutions.
Then the outputs of each branch are converted into 1D arrays and concatenated.
More linearly connected layers proceed the concatenation.
Finally some output value or class values are output by the network.

(Note: This is just an example to illustrate the use of branched networks for two input images)

Many thanks in advance!

justusschock · December 3, 2018, 5:32pm

Yes you can do that,since the graph is build dynamically and at the time you call torch.cat, autograd will build an all together graph out of your branch-graphs and correctly handle backprop.

spacemeerkat · December 4, 2018, 10:53am

That’s great news thank you!

spacemeerkat · December 4, 2018, 3:00pm

Is there a specific way in which to pass your batches to a branched network?

For example say your batch had dimensions [N,C,H,W] of [67,2,64,64], and your network looked like the following:

class BranchedNet(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        self.feature_extractor = torch.nn.Sequential(
            
            torch.nn.Conv2d(1,64,3,padding=1), 
            torch.nn.ReLU(),
            torch.nn.Conv2d(64,128,3,padding=1),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2),
            torch.nn.Conv2d(128,256,3,padding=1),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2))
            
        self.feature_extractor2 = torch.nn.Sequential(
            
            torch.nn.Conv2d(1,64,3,padding=1), 
            torch.nn.ReLU(),
            torch.nn.Conv2d(64,128,3,padding=1),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2),
            torch.nn.Conv2d(128,256,3,padding=1),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2))
             
        self.classifier = torch.nn.Sequential(
            torch.nn.Linear((256*16*16)*2,264),
            torch.nn.ReLU(),      
            torch.nn.Linear(264,264),
            torch.nn.ReLU(),
            torch.nn.Linear(264,1))
        
    def forward(self,x,y):
        features = self.feature_extractor(x)
        features = features.view(int(x.size()[0]),-1)
        features2 = self.feature_extractor(y)
        features2 = features2.view(int(y.size()[0]),-1)
        grouped = torch.cat(features,features2)
        output = self.classifier(grouped)

        return output

Would you have to specify what x and y are during training?

I only ask because I have attempted this with the above structures and I get an error:

TypeError: forward() missing 1 required positional argument: ‘y’

…when simply passing model(batch) during training

EDIT:

Solved by @ptrblck -see Concatenate layer output with additional input data

Abhinav_Ravi · July 10, 2019, 4:44pm

What if loss_1 and loss_2 are coming from two separate nn.Modules networks and final_loss = 2*loss_1 + 3 *loss_2. Still only one final_loss.backward() would calculate grad in both networks?

Dhawgupta · October 26, 2019, 5:04pm

HI this might sound trivial , but how exactly are we getting branch_1 and branch_2, from a single class object, I though for a model object say model , we will have only one set of paremeters for both the networks i.e. model.parameters(), sorry if it may sound trivial I am just getting started

moreshud · March 28, 2021, 12:17pm

Can I use this idea to handle multiple predictions or branches on a single network with different loss functions? i.e

outputs = model(input)  # [32, 7, 256, 256]
o1 = outputs[:, [0,1], :,:] # [32, 2, 256, 256]
o1 = outputs[:, [2,3,4,5,6], :,:] # [32, 5, 256, 256]

# Approac I

l1 = loss1(o1, target1)
l2 = loss2(o2, target2)

if phase == 'train':
    l1.backward(retain_graph=True) 
    l2.backward()
    optimizer.step()

# Approac II

l1 = loss1(o1, target1)
l2 = loss2(o2, target2)

if phase == 'train':
    (l1+l2).backward() 
    optimizer.step()

Which of the approaches is computationally efficient and helps the model to learn better (that is able to retain what’s learnt in previous epoch)? In the case of the second approach, does reporting the values of l1 or l2 depict the real values of their individual losses after taking their accumulated sum in the next epochs?

thomascong121 · April 7, 2021, 8:11am

Hi Dhawgupta, I have a very similar question as you, I would like to ask if you have found any solutions?