The most common question on `optimizer.zero_grad()`. Just re-confirming my understanding

iArunava · May 13, 2019, 1:11am

Is it

optimizer.zero_grad()
out = model(batch_X)
loss = loss_fn(out, batch_y)
loss.backward()
optimizer.step()

or

out = model(batch_X)
loss = loss_fn(out, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

I use the second one, as I belive just before the gradients are calculated, I need to zero out the existing ones.

But, my project reviewer on Udacity says to use the first one.
The model is LSTM.

Also, I think both are the same, isn’t it?

MariosOreo · May 13, 2019, 2:23am

Hi bro,

I also think they are the same.

I have read a thread about correct order in this forums (I cannot find it now). From developers’ comment, they recommeded to use the first way you mentioned above.

iArunava · May 13, 2019, 2:30am

Thanks a ton man. Also, in case you stumble upon that thread. Just let me know here
Thanks

MariosOreo · May 13, 2019, 2:52am

Sorry for misremembering it.

In this comment, he just recommended calling optimizer.zero_grad() before .backward().

alex.veuthey · May 13, 2019, 7:05am

I also agree that both should be the same, and this makes sense, as the gradients are only computed when backward() is called…

ptrblck · May 13, 2019, 8:46am

I would argue it depends on your “workflow” as both approaches yield the same result as others already said.

I personally prefer the first approach due to my mindset of
“new iteration -> new gradients -> get rid of the old ones”.
Otherwise I’ve sometimes forgotten to zero out the gradients.

iArunava · May 20, 2019, 2:00am

Thanks @ptrblck for your answer