equivalent to setting the rest of the parameters’ requires_grad flag to False?
for i,param in enumerate(model.parameters()):
if i != 0:
param.requires_grad = False
optim = Adam(model.parameters())
What I can think of is that the second method prevents the gradients from being computed at all, while the first method merely prevents them from being updated. Is this correct, or are there no/other differences?
What you think is right. The first method, only passing subset of model.parameters to optimizer that it only update the related parameters. In the second method, the gradients of parameters with requires_grad=False are not computed in backpropagation so that they will not be updated.
As you said, the second method prevents the gradients from being computed, so it will save some memory. I wrote a small demo to observe the memory_allocated in each iteration, it seems that the second method is the better solution.
The second method
model = Net().cuda()
for name, params in model.named_parameters():
if name != 'conv1.weight':
params.requires_grad=False
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
criterion = nn.MSELoss()
print("memory_allocated ", torch.cuda.memory_allocated())
print("max_memory_allocated ", torch.cuda.max_memory_allocated())
for i in range(5):
print("Iteration ", i)
optimizer.zero_grad()
output = model(random_input)
loss = criterion(output, random_target)
loss.backward()
optimizer.step()
print("memory_allocated ", torch.cuda.memory_allocated())
print("max_memory_allocated ", torch.cuda.max_memory_allocated())
For the same purpose (i.e. fixing a subset of parameters during training but updating other parameters that potentially needs to use gradients from parameters from this fixed set), I found the first method much easier to implement than writing a register_hook function (e.g. as in Update only sub-elements of weights). Thanks for the great question and the great explanation here!