Correct parameter update


I have probably some really simple questions how the general parameter update of the network while training works and what the best practice for my problem is. So in my case I have a network which has two subnets where the output from each will be concatenated and fed into another one. Pretty simple so far.

class SubNetA:
        def __init__(self):

class SubNetB:
        def __init__(self):

class SuperNet:
        def __init__(self):
class MyNet:
        def __init__(self):
            self.subnetA  = SubNetA()
            self.subnetB  = SubNetB()
            self.supernet = SuperNet()
        def forward(self, x):
            yA   = self.subnetA(x)
            yB   = self.subnetB(x)
            yCat =, yB, dim=1) 
            return self.supernet(yCat), yA, yB

My Idea right now is to train subnetA and subnetB simultanously on the same data in one training session, since they are completly independend from each other. But what is the best practice here to only train or update the weights from those subnets?

So far, I would do this (it’s not real code, just to show what I mean):

models = MyNet()
optim  = optim.SGD( models.parameters(), ... )

# Training of subnetA and subnetB only
models.supernet.parameters.requires_grad = False
for iBatch, iTarget in DataLoader:
    _, yA, yB = models(iBatch)
   lossA = LossFunction(yA, iTarget)
   lossB = LossFunction(yB, iTarget)


# Training of supernet only with already trained subnetA and subnetB
models.subnetA.parameters.requires_grad = False
models.subnetB.parameters.requires_grad = False
for iBatch, iTarget in DataLoader:
    y, _, _= models(iBatch)
   loss = LossFunction(y, iTarget)

Does this approach looks okay so far or can this lead to some problems? Then maybe two additional questions just to make sure I understood the autograd correct:

  1. Is it enough to initialize the optimizer with all parameters from the whole model and just set requires_grad to False in order to only train one subpart of the whole network?
  2. If I call loss.backward() are there only all weights being updated from the layers before or has it also some effects on the layers after this? For instace, I would only calculate the loss of subnetA and would not set required_grad to false. Are then only the weights from subnetA being updated?
  3. Can I unite the loss values to only call .backward() once? Like lossA + lossB = loss_ → loss_.backward()

Probably some easy questions but thanks in advance for some good hints :slight_smile:

Your pseudo code looks okay in general. But remember to call ‘supernet.requires_grad_(True)’ before you train the supernet cuz its grad was turned off in the first stage. Similarly, call subnet.requires_grad_(True) before you train the subnet.

Regarding your bullet points,

  1. yes, it’s enough to call require_grad_(False) to disable this part’s update since the optimizer updates each parameter based on corresponding grad.
  2. Yes, if loss is only based on subnetA then other nets’ gradient will be zero.
  3. I’m not sure what you’re asking. Do you mean merging training of subnets and supernet?


sorry for the late response, but thanks a lot for your explanations so far. What I meant with the third point was more or less if both cases are doing the same in the end:


y1, y2, y3   = model(x)
loss1        = Loss(y1, gt)
loss2        = Loss(y2, gt)
loss3        = Loss(y3, gt)



y1, y2, y3   = model(x)
loss1        = Loss(y1, gt)
loss2        = Loss(y2, gt)
loss3        = Loss(y3, gt)
loss_all     = loss1 + loss2 + loss3