I am meeting a strange problem where the loss multiplied by 0.0 still have effect on my model during training.
I do not want to have a lot of if else statements in my code, that’s why I decided to use multiplying by 0.0 as a switch.
I am meeting a strange problem where the loss multiplied by 0.0 still have effect on my model during training.
I do not want to have a lot of if else statements in my code, that’s why I decided to use multiplying by 0.0 as a switch.
Multiplying the loss with 0.0
will create zero gradients. However, your model could still “change” e.g. since running stats are updated in each forward pass in e.g. batchnorm layers during training (i.e. if the model is in model.train()
mode). Also, the optimizer could still update parameters if its using running stats (e.g. Adam
) and if the parameters were already updated (i.e. if the running stats are already set) even if the gradient is set to zero.
I have not used BatchNorm in my model but I do indeed use Adam. However, based on the equation above, it seems like the optimizer only store the running stats of gradient, which for loss2 is always 0. (loss = loss1 + loss2, where loss2 is always multiplied by 0). Why will it still affect the training dynamics? Am I understanding Adam wrong?
The running stats are computed and updated in each step()
and kept for the next update. Even if all gradients in the current step()
are 0.0
the optimizer will still update them as seen here:
model = nn.Linear(1, 1, bias=False)
optimizer = torch.optim.Adam(model.parameters(), lr=1.)
x = torch.randn(1, 1)
out = model(x)
out.mean().backward()
print(model.weight)
# tensor([[-0.7789]], requires_grad=True)
optimizer.step()
print(model.weight)
# tensor([[-1.7789]], requires_grad=True)
optimizer.zero_grad()
print(model.weight.grad)
# tensor([[0.]])
optimizer.step()
print(model.weight)
# tensor([[-2.4490]], requires_grad=True)
# delete gradients
optimizer.zero_grad(set_to_none=True)
print(model.weight.grad)
# None
optimizer.step()
print(model.weight)
# tensor([[-2.4490]], requires_grad=True)
You would have to delete them (e.g. via set_to_none=True
) so that Adam
ignores them.
Are you using dropout? This is randomly assigned during each batch.
I have drop path no drop out. Will drop path affect? How so?
This example is different from my case. My case has 2 losses, where second loss is always multiplied by 0, not saying the subsequent iteration has 0 gradient. So my question is will this second loss when multiplied with 0 the same as not adding it, in the case of Adam.
No, the second loss which is multiplied by 0
won’t have any effect on the first loss and the optimizer.