Order of backward(), step() and zero_grad()

gfotedar · April 22, 2021, 2:55am

In most codes the order I see is

training loop:
    # forward pass and calculate loss
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

If I change it to:

training loop:
    # forward pass and calculate loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

is it still ok?

googlebot · April 22, 2021, 3:21am

Yes, it is clear-fill-use vs fill-use-clear

rad · April 22, 2021, 6:35am

Yes, but you’ll backpropagate n-1 times only, where n is the number of epochs. Why would you not want to follow the “standard” order of the 3?

googlebot · April 22, 2021, 9:46am

there are subtle differences in memory management, esp. when zero_grad(set_to_none=True) is used

rozlana-g · December 22, 2022, 10:26am

Why backpropagation will take place n-1 times only in that case?

nwn · December 22, 2022, 1:37pm

Hello Gaurav,
yes, both should work as long as your training loop does not contain another loss that is backwarded in advance to your posted training loop, e.g. in case of having a more complex architecture that consists of several ANNs and loss functions that all modify the same network during their backward and step pass.
Important is only that your gradients are set to zero before you call the backward command. I does not matter whether you do it at the beginning of a loop or at the end of the previous one.

But calling zero_grad just before the backward pass is certainly more readible. Therefor, I would stick to using it.

IdoAmit198 · December 22, 2022, 2:47pm

Hi, can you please share more details about the differences?

ptrblck · December 22, 2022, 8:19pm

.zero_grad(set_to_none=True) will delete the .grad attributes and will thus free memory.
Calling it early might thus be beneficial assuming you don’t need the gradients anymore.