Two branches architecture, and back-propagate just one branch loss

Hi Dears,

I created the following architecture which has one main branch, but in the decision layer it has two braches where resnet FC layer is shared:

class Net1(nn.Module):

    def __init__(self, in_features, out_features):
        super(Net1, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.Tensor(out_features, in_features))

    def forward(self, input, label):
        out =,  self.weight.t())
        return out

classifier =  net1(1024, 624)  

class MyRes(nn.Module):
    def __init__(self, in_features = 1024, Num_classes = 624):
        super(MyRes, self).__init__()
        self.in_features = in_features
        self.res_model =  torchvision.models.resnet50(pretrained=True) 
        self.res_model.fc = nn.Sequential(
            nn.Linear(2048, in_features),
        self.classifier = classifier
        self.res_model.classifier = nn.Linear(in_features, Num_classes)
    def forward(self, x, labels):        
        x = self.res_model(x)
        l = self.classifier(x, labels)            
        x = self.res_model.classifier(x)
        return x, l

In the training, I calculated the loss for just the first branch then back-propagated it

model =  MyRes()
optimizer = optim.Adam(model.parameters(), lr = lr)

pred, preds2 = model(images)

loss = criterion(preds, labels)        

I am wondering what happens at the main branch, especially at the connection point of the two branches !!.
Is there will be two gradient (from self.classifier and self.res_model.classifier ), In back propagation they summed !!!
What happens to the weights in (self.weight) from Net1, Are they will be learned with the network training or stay fixed as they initialized !!

Best Regards

Your current architecture creates two new classifiers and assigns one to self.classifier and the other one to a new attribute called self.res_model.classifier. Note that you are not replacing the last linear layer of the resnet in the latter case, as it’s assigned to self.res_model.fc (and already replaced with the nn.Sequential module). This isn’t necessarily wrong, but you should be aware that the forward pass of self.res_model will not call self.res_model.classifier.

This will use uninitialized memory, which is most likely not what you want.
Don’t use the torch.Tensor constructor, but the factory methods instead (torch.randn, torch.ones, etc.).

No, since you are calculating the loss from one output only (preds in the posted code snippet, which is undefined, so I don’t know which output you are using). If you are calculating the loss from both outputs, both computation paths will be used to calculate the gradients of the parameters in the main branch.

If you are using the model’s output to calculate the loss (again unclear based on your posted code), then it will be updated unless you freeze the parameters by setting their .requires_grad attribute to False.

1 Like

Dear ptrblck,

Thanks for your reply,

Actually you are right the shared point is the output of (the replacement of self.res_model.fc) nn.Sequential model (which is 1024 )

The idea here, I am using CosFace self.classifier along side with another FC layer for classification,

In more details,
In the training phase,

  1. Foreword pass: the two branches are used self.classifier (Cosface) and self.res_model.classifier (the classification output), but the loss is calculated from just the classification branch,
  2. back-propagation, the loss calculated from the classification output is used to back propagate, (no use of the cosface)

In the testing phase:
only the classification branch is active (the cosface was not used at all)

My main question, since I am not using the CosFace branch to calculate the loss and back propagate it, how it influence my training (recall the results are very different from when I calculate the loss from the CosFace and back propagate it, also when I dont use the CosFace at all)

Best Regards

It shouldn’t influence the training at all, if you are just calling its forward without using any output.

Thanks a lot,

I found adding FC layers to resnet improved my results a lot
So the cosface was not used at all

Best Regards