I realized I had a bug in my model with multiple optimizers, because some optimizers had overlapping param groups (I assume that is never supposed to happen).
I’ve been debugging things around and I noticed this weird (somewhat unrelated) behaviour:
Here I take one parameter from one optimizer:
In[25]: optim.param_groups[0]["params"][0]
Out[25]:
Parameter containing:
tensor([[ 0.0239, 0.0519, 0.0257, ..., -0.0024, 0.0691, -0.0380]],
device='cuda:0', requires_grad=True)
I zero out the gradient and step (again, not something you’re supposed to do, but that’s what was happening in my project for a while so I’d like to understand the effect it had):
In[26]: optim.zero_grad()
In[27]: optim.param_groups[0]["params"][0].grad
Out[27]: tensor([[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')
In[28]: optim.step()
Even though the gradient was 0, the tensor values changed a bit:
In[29]: optim.param_groups[0]["params"][0]
Out[29]:
Parameter containing:
tensor([[ 0.0238, 0.0518, 0.0258, ..., -0.0024, 0.0691, -0.0380]],
device='cuda:0', requires_grad=True)
And the gradient is no longer 0, but very small values:
In[30]: optim.param_groups[0]["params"][0].grad
Out[30]:
tensor([[ 2.3881e-08, 5.1892e-08, 2.5746e-08, ..., -2.3582e-09,
6.9058e-08, -3.8007e-08]], device='cuda:0')
I this normal behaviour? Why were the tensor values changed at all?