Order of backward(), step() and zero_grad()

In most codes the order I see is

training loop:
    # forward pass and calculate loss
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

If I change it to:

training loop:
    # forward pass and calculate loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

is it still ok?

5 Likes

Yes, it is clear-fill-use vs fill-use-clear

1 Like

Yes, but you’ll backpropagate n-1 times only, where n is the number of epochs. Why would you not want to follow the “standard” order of the 3?

there are subtle differences in memory management, esp. when zero_grad(set_to_none=True) is used

Why backpropagation will take place n-1 times only in that case?

Hello Gaurav,
yes, both should work as long as your training loop does not contain another loss that is backwarded in advance to your posted training loop, e.g. in case of having a more complex architecture that consists of several ANNs and loss functions that all modify the same network during their backward and step pass.
Important is only that your gradients are set to zero before you call the backward command. I does not matter whether you do it at the beginning of a loop or at the end of the previous one.

But calling zero_grad just before the backward pass is certainly more readible. Therefor, I would stick to using it.

Hi, can you please share more details about the differences?

.zero_grad(set_to_none=True) will delete the .grad attributes and will thus free memory.
Calling it early might thus be beneficial assuming you don’t need the gradients anymore.

1 Like