Gradients and the way it is computed

I have three neural networks, model1, model2 and model3. I feed an input to model1 and output of this network is fed to model2 and further, the output of the model2 is fed to model3. I have two loss functions. One for model2 alone and one a global loss function which optimizes model1 and model3 weights only. model2 is optimized only for a few epochs, after which it is run in forward prop only.

I have defined the optimizers like below:

global_opt = optim.Adam(list(model1.params())+list(model3.params()))
model2_opt = optim.Adam(model2.params())

For the first few epochs I compute the following:

criterion = nn.MSE()
global_loss = criterion(obtained, expected)
global_loss.backward()
global_opt.step()

model2_crit = nn.BCE()
model2_loss = model2_crit(a, b)
model2_loss.backward()
model2_opt.step()

While executing these, all networks are set to train using model1.train(), model2.train() and model3.train().

After the first few epochs, I want to stop training model2. I can’t set model2.eval() because my model2 contains RNNs and the gradients have to flow through model2 to model1.

So in the training loop I only have:

criterion = nn.MSE()
global_loss = criterion(obtained, expected)
global_loss.backward()
global_opt.step()

My question is (it might be silly), does model2 still update its weights after the training has subsided (when the last piece of code is the only running code in the training loop)? Thank you for reading this question.

1 Like

model2's parameters won’t be updated unless you call model2_opt.step().
However, if your model contains batchnorm layers, the running estimates will be updated during the forward pass, if you keep these layers in .train() mode.
Also, dropout will be applied, but of course there is nothing to update.

That being said, the .grad attributes of model2 will accumulate the gradients, if you don’t set the requires_grad attribute of the parameters to False.

1 Like

Thank you for your response. I am calling model2_opt.zero_grad() in each training loop. Won’t this suffice in order to prevent gradients from accumulating?

Yes, this should be sufficient. :slight_smile:

1 Like