The most common question on `optimizer.zero_grad()`. Just re-confirming my understanding

Is it

optimizer.zero_grad()
out = model(batch_X)
loss = loss_fn(out, batch_y)
loss.backward()
optimizer.step()

or

out = model(batch_X)
loss = loss_fn(out, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

I use the second one, as I belive just before the gradients are calculated, I need to zero out the existing ones.

But, my project reviewer on Udacity says to use the first one.
The model is LSTM.

Also, I think both are the same, isn’t it?

1 Like

Hi bro,

I also think they are the same.

I have read a thread about correct order in this forums (I cannot find it now). From developers’ comment, they recommeded to use the first way you mentioned above.

1 Like

Thanks a ton man. Also, in case you stumble upon that thread. Just let me know here :slight_smile:
Thanks :slight_smile:

Sorry for misremembering it.

In this comment, he just recommended calling optimizer.zero_grad() before .backward().:disappointed_relieved:

I also agree that both should be the same, and this makes sense, as the gradients are only computed when backward() is called…

I would argue it depends on your “workflow” as both approaches yield the same result as others already said.

I personally prefer the first approach due to my mindset of
“new iteration -> new gradients -> get rid of the old ones”.
Otherwise I’ve sometimes forgotten to zero out the gradients. :wink:

9 Likes

Thanks @ptrblck for your answer :slight_smile:

1 Like