I am new to PyTorch and am trying to build a neural network that has two sequential networks trained and evaluated in this particular sequence:
Part 1: the ‘latter’ part of the network is trained in isolation.
Part 2: the ‘former’ part of the network is trained with its output fed to the ‘latter’ part of the network (held constant), and the output of the ‘latter’ part is used for the loss.
Part 3: data is evaluated with the ‘former’ part of the network only.
My question is: I’m assuming I should have the ‘latter’ gradients known in part 2, correct? I would like to have the functions of the ‘latter’ part of the network operate on the output of the ‘former’ part, but NOT update the ‘latter’ parts parameters in the process.
Based on the example below, will PyTorch “know” that ‘latter’ operated on the loss, and include it in the computation graph and pass its gradients backwards? Assume the latter is already trained in the example below.
Any insight is highly appreciated
inp_var, out_var = Variable(field), Variable(lens) # from torch.utils.data.DataLoader
optimizer.zero_grad() # reset gradients to 0 for new batch
output = former_model(inp_var) # forward pass of former model (to be trained)
output = latter_net(output) # latter_net is pre-trained, output of former net fed to latter
loss = criterion(output, inp_var) # loss function, comparing loss against the INPUT to the former
loss.backward() # backward pass
Yes, PyTorch “know” that ‘latter’ operated on the loss, and will include it in the computation graph, and automatically backward its gradients.
If you want not to update of net-parameters of latter_net, you should make param.requires_grad = False before use latter_net like below:
for param in latter_net.parameters():
param.requires_grad = False
Thank you, I had set up two separate optimizers for each sub-network in a similar fashion, and as a check saved the ‘latter’ model (again as a different file) after optimizing the ‘former’ with the ‘latter’ held constant, and it appears to have not changed as I desired.
Do you think setting requires_grad = False is not necessary then?
Interesting - I read into the PyTorch documentation and it mentions the same remedy for freezing parts of the network.
So to get this straight: even though requires_grad=False is set, it will still compute the gradients for that part of the network, but simply not update the weights? The name requires_grad seems a bit misleading, as the gradients are indeed calculated… ?
It’s necessary. If you do not set requires_grad = False，and just optimizer.set_parameters(former_model.parameters()) to enable not to update parameters,
but latter_model’s gradients are still computed, causing to take up much GPU-memory
Thank you for the reply. Assuming memory is not a problem, I would still achieve the same result though. I’m more concerned about my conceptual understanding at this point, but since it is invariably recommended to set requires_grad = False of course I will do so to improve memory consumption.