I was going through the following information on reducing learning rates in PyTorch to really low value like 1e-9.
And I am amazed why doing loss = loss/100 is equivalent to reducing learning rate by 100? The full snippet is below.
outputs = model(batch)
loss = criterion(outputs, targets)
# Equivalent to lowering the learning rate by a factor of 100
loss = loss / 100
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
If you scale the loss, you’ll also scale the gradients.
In a simple use case, this can be used instead of changing the learning rate, as seen here:
# Setup
torch.manual_seed(2809)
lin = nn.Linear(2, 2, bias=False)
x = torch.randn(1, 2)
# Standard approach
out = lin(x)
loss = out.sum()
print(loss)
loss.backward()
print(lin.weight.grad)
> tensor(-0.8130, grad_fn=<SumBackward0>)
tensor([[-1.1281, 0.8386],
[-1.1281, 0.8386]])
# loss scaling by x10
lin.zero_grad()
out = lin(x)
loss = out.sum() * 10
print(loss)
loss.backward()
print(lin.weight.grad)
> tensor(-8.1301, grad_fn=<MulBackward0>)
tensor([[-11.2812, 8.3855],
[-11.2812, 8.3855]])
# loss scaling by x0.1
lin.zero_grad()
out = lin(x)
loss = out.sum() * 0.1
print(loss)
loss.backward()
print(lin.weight.grad)
> tensor(-0.0813, grad_fn=<MulBackward0>)
tensor([[-0.1128, 0.0839],
[-0.1128, 0.0839]])
However, I would be careful with more advanced optimizers, which are e.g. tracking the running average of the gradients or are adjusting the learning rate internally somehow.