Can someone explain what is the difference between step() and backward() and what they do?
also can you explain when we should use zero_grad()? should we use it whenever we want to do backward()?
Hopefully, you use them in the other order  opt.zero_grad()
, loss.backward()
, opt.step()
.

zero_grad
clears old gradients from the last step (otherwise you’d just accumulate the gradients from allloss.backward()
calls). 
loss.backward()
computes the derivative of the loss w.r.t. the parameters (or anything requiring gradients) using backpropagation. 
opt.step()
causes the optimizer to take a step based on the gradients of the parameters.
Best regards
Thomas
Very gooooooooooooooood, man!
Does zero_grad clear all state information in the optimizer or can the optimizer maintain some information from the previous minibatch?
For example 1: if I change batch size as a function of epoch do I need to recreate the optimizer to zero out its internal state?
For example 2: if I have a common model (cnn layers), and 2 different heads (FC layers) with a small amount of data for head 1 and head 2, and a lot of data for head 1. I can train head 1, then train head 2. Can I ping pong between training head 1 and head 2 or with the optimizer get all confused because of an internal state or discontinuous loss from the previous minibatch?
Note: the above examples are only to understand the optimizer and if it has any state information.
Both examples should work with a single optimizer*.
Basically, every tensor stores some information about how to calculate the gradient, and the gradient. The gradient is (when initialized), the same shape but full of 0s. When you do backward
, this info is used to calculate the gradients. These gradients are added to each tensor’s .grad
. When you do step
, the optimizer updates the weights based on the gradients. When you do zero_grad
, the gradients are set to 0.
Most optimizers store extra data (eg momentum of the weights). However, this is the same size as the weights, and so is independent of the batch size.
*If you have two different models or parts of a model that you train separately, it makes sense to create two different optimizers. As long as you provide the correct parameters to the optimizer, this will work.
Thank you, your response is very helpfull in my quest to learn pytorch.
Would you elaborate on what zero_grad
is doing to 'accumulate the gradients from all loss.backward()
calls`? The current gradients apply math ±*/ on previous gradients? Or stacking those tensors together occupying memory which in most cases we would avoid? Thanks
@jkf  If we do not do zero_grad
, then what will happen is that the the gradients of all your previous batches will keep adding up for a particular weight say w
and the weight will get updated with the sum of all the previous gradients instead of the gradient of the current batch.