Zero grad optimizer or net?

vvanirudh · April 15, 2017, 8:33pm

What should we use to clear out the gradients accumulated for the parameters of the network?

 optimizer.zero_grad()
 net.zero_grad()

I have seen tutorials use them interchangeably. Are they the same or different? If different, what is the difference and do you need to execute both?

chenyuntc · April 16, 2017, 4:22am

if optimizer = optim.Optimizer(net.parameters()),they are the same.

    def zero_grad(self):
        """Sets gradients of all model parameters to zero."""
        for p in self.parameters():
            if p.grad is not None:
                p.grad.data.zero_()

chenjus · August 5, 2017, 5:51pm

We’re supposed to clear the gradients each iteration before calling loss.backward() and optimizer.step(), correct?

dgriff · August 5, 2017, 7:51pm

Yes, that’s is the suggested usage

QuantScientist · August 13, 2017, 8:15pm

Hi,
I am trying an example adapted from here:![nn02|478x500]

I copy pasted the example and this exception is thrown:

Has the API changed?
Thanks,

tom · August 13, 2017, 8:19pm

These days, if there isn’t a gradient (yet), .grad can be None.
When I had that while trying some code, I fixed it with if w1.grad is not None: .

Best regards

Thomas

griver · November 18, 2017, 4:18pm

Hi,
I’m trying to use an optimizer only on a part of the module parameters. Should I call module.zero_grad() instead of optimizer.zero_grad() if there are other layers between the loss and layer that i’m training?

And what if I need to train only the last layer(right before the loss) of a module? The gradients for the previous layers wouldn’t be computed at all. So there is no difference between calling module.zero_grad() and calling optimizer.zero_grads(), right?

isalirezag · January 13, 2019, 4:25pm

any solution regarding this question?

fermat97 · January 22, 2019, 6:28pm

I also think there is no difference between model.zero_grad() and model_optimizer.zero_grad()

ptrblck · January 23, 2019, 6:50am

As @chenyuntc explained, if you pass all parameters of your model to the optimizer, both calls will be equal.
However, there might be use cases where you would like to use different optimizers for different parts of your model. In such a case, model.zero_grad() would clear all parameters of the model, while the optimizerX.zero_grad() call will just clean the gradients of the parameters that were passed to it.

Ankit_Taxak · January 29, 2019, 3:12pm

But if when we added a extra layer like this

classifier = nn.Linear(128, 200,bias=False)
nn.init.xavier_uniform_(classifier.weight)
model.add_module(“classify”,classifier)

but we do not want to update the parameter of this layer. So in this case what we should do?

tom · January 29, 2019, 3:41pm

Something like

for p in classifier.parameters():
    p.requires_grad_(False)

before training should do the trick. Also filter the parameter from what you pass to the optimizer.

Best regards

Thomas

Joker · March 26, 2019, 4:16am

For example ,when we create an optimizer like this:
optimizer_ft=optim.SGD(model_ft.parameters(),lr=0.001,momentum=0.9)
we will add the model’s parameters to this optimizer,so when we call the function optimizer.zero_grad() it will update these parameters ,but if there is another model which did’t register this optimizer, optimizer.zero_grad() will not update it’s grad

Purushothaman_Srikan · April 4, 2020, 3:33pm

@chenyuntc Can u please elaborate your answer?

Thanks in advance…