Could you give me some tips to train several decoders

Hello all,

I need to train 3 torch.nn.Module individually. For this, instead of using 3 separated torch.optim.Adam I am concatenating their parameters in this way.

models = torch.nn.ModuleList([net1, net2, net3])
optimizer = torch.optim.Adam(models.parameters(), lr=1e-3)

As I need to train 3 models, I need to use 3 inputs for training them, one for each model.

for epoch in range(100): # per epoch
    for features in dataloader: # I have features= [input0, input1, input2], one per each model.
        for id, input in enumerate(features):
            loss = models[id](input) # each model ingests one input from features
            optimizer.zero_grad() # forward
            loss.backward() # backward
            torch.nn.utils.clip_grad_norm_(models[id].parameters(), 1e0) # clipping
            optimizer.step() # adam step

Then, I tried to simplify the training loop in this way

for epoch in range(100):
    optimizer.zero_grad() # Zero gradient per epoch == 3 models do one step of training
    for features in dataloader:
        for id, input in enumerate(features): 
            loss = models[id](input) # forward per model
            loss.backward() # backward per model
    # Then, clipping the 3 models and step optimizer
    torch.nn.utils.clip_grad_norm_(models.parameters(), 1.)

For my surprise, after testing the change for 3 trials, the second version always returns a lower performance that the first one.

What am I missing here?

Thanks in advance

Hi @Miguel_Campos,

What I think might be happening is that when you move the optimizer.zero_grad() outside your dataloader for-loop, you’re accumulating gradients throughout your entire batch. So you first use the gradient of the first example, then add the gradient of the 2nd example to it and so on, then once you’ve used your entire batch you then update your parameters.

In the other case (first example), it seems you take a single example, compute its loss then update your parameters and repeat this for all samples in your batch.

So, it seems like the first example is a single-sample mini-batch gradient descent and the second example is full-batch gradient descent.

The reason why the second code performs better may be due to full-batch gradient descent having little noise within its updates, so it can easily minimise the loss but for a single-sample stochastic gradient descent (SGD) it has too much noise to accurately update your model. Usually for SGD a sub-sample of the batch (aka mini-batch) consistently of 10s of samples is used to get a useful update, but still get some noise to not get stuck within a local minima.

Hello @AlphaBetaGamma96 .

Sorry, I fixed a important mistake in the second version (The position of optimizer.zero_grad())

Could you take a look at it again and confirm your theory?

As I said, the first version is performing better than the second one.

The training loops are not equivalent so a difference in performance would be expected. As @AlphaBetaGamma96 explained you are accumulating the gradients for an entire epoch in the second approach which is not the case in the first approach.
Also, calling optimiyer.step() in the first approach will update the parameters of all models even of the models which were not used in the current iteration, since Adam creates running stats for each param and will use it for an update even with a zero gradient.

Hello, It seems that you are saying that the second one is the correct way.

  • If yes, why could be the first method performing better?
  • How can it be possible that Adam updates the parameters of a not used model?
  • Should I create 3 different optimizers to train the models
  • Maybe you could suggest me the correct way to do this training.

@AlphaBetaGamma96 said “ The reason why the second code performs better” but it is the opposite case.


It seems that you are saying that the second one is the correct way.

No, as the first approach is the common training loop (besides iterating all samples in the nested features loop).

If yes, why could be the first method performing better?

Stochastic gradient descent empirically performs better than gradient descent (using the entire dataset).

How can it be possible that Adam updates the parameters of a not used model?

Once a parameter was updated, Adam will create running stats and will use these to update the parameter even if the gradient is 0.. Set the .grad attribute to None and Adam will skip the update.

1 Like

Thank you for all the replies.

Last question is, having 3 models, would you create 3 individual optimizers?

It depends on the use case and how each model should be updated. I.e. if you want to update models separately based on some conditions, I would create 3 separate optimizers and call them explicitly. On the other hand, if you are using a standard training loop where all models are used, get gradients, and should be updated, you could also stick to one optimizer.