Passing a subset of the parameters to an optimizer equivalent to setting requires_grad of subset only to True?

AIJoris · April 17, 2019, 6:45pm

Hello,

Is creating an optimizer by passing only a subset of model’s parameters:

weight_matrix = list(model.parameters())[0]
optim = Adam([weight_matrix])

equivalent to setting the rest of the parameters’ requires_grad flag to False?

for i,param in enumerate(model.parameters()):
    if i != 0:
        param.requires_grad = False
optim = Adam(model.parameters())

What I can think of is that the second method prevents the gradients from being computed at all, while the first method merely prevents them from being updated. Is this correct, or are there no/other differences?

Thanks!

MariosOreo · April 18, 2019, 1:53am

Hello Joris,

What you think is right. The first method, only passing subset of model.parameters to optimizer that it only update the related parameters. In the second method, the gradients of parameters with requires_grad=False are not computed in backpropagation so that they will not be updated.

As you said, the second method prevents the gradients from being computed, so it will save some memory. I wrote a small demo to observe the memory_allocated in each iteration, it seems that the second method is the better solution.

The second method

model = Net().cuda()
for name, params in model.named_parameters():
    if name != 'conv1.weight':
        params.requires_grad=False
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
criterion = nn.MSELoss()
print("memory_allocated ", torch.cuda.memory_allocated())
print("max_memory_allocated ", torch.cuda.max_memory_allocated())

for i in range(5):
    print("Iteration ", i)
    optimizer.zero_grad()
    output = model(random_input)
    loss = criterion(output, random_target)
    loss.backward()
    optimizer.step()
    print("memory_allocated ", torch.cuda.memory_allocated())
    print("max_memory_allocated ", torch.cuda.max_memory_allocated())

output:

memory_allocated 3584
max_memory_allocated 3584
Iteration 0
memory_allocated 5120
max_memory_allocated 39936
Iteration 1-4
memory_allocated 5120
max_memory_allocated 41472

The first method

model = Net().cuda()
optimizer = torch.optim.SGD(model.conv1.parameters(), lr=0.1)
criterion = nn.MSELoss()
print("memory_allocated ", torch.cuda.memory_allocated())
print("max_memory_allocated ", torch.cuda.max_memory_allocated())

for i in range(5):
    print("Iteration ", i)
    optimizer.zero_grad()
    output = model(random_input)
    loss = criterion(output, random_target)
    loss.backward()
    optimizer.step()
    print("memory_allocated ", torch.cuda.memory_allocated())
    print("max_memory_allocated ", torch.cuda.max_memory_allocated())

output:

memory_allocated 3584
max_memory_allocated 3584
Iteration 0
memory_allocated 7168
max_memory_allocated 39936
Iteration 1-4
memory_allocated 7168
max_memory_allocated 43520

hoangcuong2011 · March 6, 2021, 8:28pm

For the same purpose (i.e. fixing a subset of parameters during training but updating other parameters that potentially needs to use gradients from parameters from this fixed set), I found the first method much easier to implement than writing a register_hook function (e.g. as in Update only sub-elements of weights). Thanks for the great question and the great explanation here!