Hello Gaurav,
yes, both should work as long as your training loop does not contain another loss that is backwarded in advance to your posted training loop, e.g. in case of having a more complex architecture that consists of several ANNs and loss functions that all modify the same network during their backward and step pass.
Important is only that your gradients are set to zero before you call the backward command. I does not matter whether you do it at the beginning of a loop or at the end of the previous one.
But calling zero_grad just before the backward pass is certainly more readible. Therefor, I would stick to using it.
.zero_grad(set_to_none=True) will delete the .grad attributes and will thus free memory.
Calling it early might thus be beneficial assuming you don’t need the gradients anymore.