What's the expected behaviour of calling optimizer.step() AFTER optimizer.zero_grad()

I realized I had a bug in my model with multiple optimizers, because some optimizers had overlapping param groups (I assume that is never supposed to happen).

I’ve been debugging things around and I noticed this weird (somewhat unrelated) behaviour:

Here I take one parameter from one optimizer:

In[25]: optim.param_groups[0]["params"][0]
Out[25]: 
Parameter containing:
tensor([[ 0.0239,  0.0519,  0.0257,  ..., -0.0024,  0.0691, -0.0380]],
       device='cuda:0', requires_grad=True)

I zero out the gradient and step (again, not something you’re supposed to do, but that’s what was happening in my project for a while so I’d like to understand the effect it had):

In[26]: optim.zero_grad()
In[27]: optim.param_groups[0]["params"][0].grad
Out[27]: tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
In[28]: optim.step()

Even though the gradient was 0, the tensor values changed a bit:

In[29]: optim.param_groups[0]["params"][0]
Out[29]: 
Parameter containing:
tensor([[ 0.0238,  0.0518,  0.0258,  ..., -0.0024,  0.0691, -0.0380]],
       device='cuda:0', requires_grad=True)

And the gradient is no longer 0, but very small values:

In[30]: optim.param_groups[0]["params"][0].grad
Out[30]: 
tensor([[ 2.3881e-08,  5.1892e-08,  2.5746e-08,  ..., -2.3582e-09,
          6.9058e-08, -3.8007e-08]], device='cuda:0')

I this normal behaviour? Why were the tensor values changed at all?

Hello Valiox!

One possibility – this would depend on the details of what you’re
doing – is that some optimizers have memory, e.g., SGD with
momentum or Adam. Therefore, even though the current
gradient is zero, the optimizer might be remembering previous
steps with a non-zero gradient, and therefore still be changing
the parameters.

I don’t think that momentum use in an optimizer changes the
gradient. (Hypothetically, it could be implemented that way,
but I don’t think it is.)

However, at the risk of being completely wrong, I seem to recall
that weight decay might be implemented by actually changing
the gradient. So turning on weight decay in your optimizer could
explain both the changing parameters and the non-zero gradient.
(!!! Possible misinformation alert !!!)

I could see your parameter values changing (under the right
conditions) if your optimizer uses momentum. My speculation
about the non-zero gradient is more suspect.

Would it be possible to re-run your test using plain-vanilla
torch.optim.SGD, making sure that momentum and
weight_decay are turned off? (I believe that they are off
by default.)

Best.

K. Frank

2 Likes

Hello KFrank,

You are absolutely correct. My optimizer was Adam with weight decay, and using SGD without weight decay gives the “expected” behaviour:

#optim = Adam(params, lr, weight_decay=1e-6)
optim = SGD(params, lr)
...

In[2]: optim.param_groups[0]["params"][0]
Out[2]: 
Parameter containing:
tensor([[ 0.0913,  0.0789,  0.0728,  ..., -0.0352,  0.0444, -0.0342],
        [ 0.0287,  0.0179,  0.0200,  ..., -0.0781, -0.0859, -0.0117],
        [-0.0233,  0.0586, -0.0877,  ..., -0.0872,  0.0094, -0.0664],
        ...,
        [-0.0286,  0.0854, -0.0020,  ..., -0.0824,  0.0082,  0.0802],
        [-0.0504, -0.0254,  0.0182,  ..., -0.0826,  0.0746,  0.0196],
        [-0.0585, -0.0425, -0.0545,  ...,  0.0924,  0.0455, -0.0788]],
       device='cuda:0', requires_grad=True)
In[3]: optim.zero_grad()
In[4]: optim.param_groups[0]["params"][0].grad
Out[4]: 
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
In[5]: optim.step()
In[6]: optim.param_groups[0]["params"][0]
Out[6]: 
Parameter containing:
tensor([[ 0.0913,  0.0789,  0.0728,  ..., -0.0352,  0.0444, -0.0342],
        [ 0.0287,  0.0179,  0.0200,  ..., -0.0781, -0.0859, -0.0117],
        [-0.0233,  0.0586, -0.0877,  ..., -0.0872,  0.0094, -0.0664],
        ...,
        [-0.0286,  0.0854, -0.0020,  ..., -0.0824,  0.0082,  0.0802],
        [-0.0504, -0.0254,  0.0182,  ..., -0.0826,  0.0746,  0.0196],
        [-0.0585, -0.0425, -0.0545,  ...,  0.0924,  0.0455, -0.0788]],
       device='cuda:0', requires_grad=True)
In[7]: optim.param_groups[0]["params"][0].grad
Out[7]: 
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

Thank you!