Hello Gaurav,
yes, both should work as long as your training loop does not contain another loss that is backwarded in advance to your posted training loop, e.g. in case of having a more complex architecture that consists of several ANNs and loss functions that all modify the same network during their backward and step pass.
Important is only that your gradients are set to zero before you call the backward command. I does not matter whether you do it at the beginning of a loop or at the end of the previous one.
But calling zero_grad just before the backward pass is certainly more readible. Therefor, I would stick to using it.
.zero_grad(set_to_none=True) will delete the .grad attributes and will thus free memory.
Calling it early might thus be beneficial assuming you don’t need the gradients anymore.
Yes, but you’ll backpropagate n-1 times only, where n is the number of epochs.
I think that’s wrong, you backpropagate n times in both cases.
.zero_grad(set_to_none=True) will delete the .grad attributes and will thus free memory.
Calling it early might thus be beneficial assuming you don’t need the gradients anymore.
Yes, and since PyTorch 2.0, set_to_none defaults to True instead of False, so using the second order of operations should save memory compared to the first order of operations. So I would advise to use:
training loop:
# forward pass and calculate loss
loss.backward()
optimizer.step()
optimizer.zero_grad()