# When should one be zeroing out gradients?

I wanted to do SGD but I wasn’t sure if I understood when one should be zeroing out gradients. There are two examples in the tutorials. One zeros before the backward+update pass and the other after the backward+update pass. Are these two the same? What is the difference? code (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn):

``````for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)

# Compute and print loss.
loss = loss_fn(y_pred, y)
print(t, loss.data)

# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable weights
# of the model)

# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()

# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
``````
``````for t in range(500):
# Forward pass: compute predicted y using operations on Variables; these
# are exactly the same operations we used to compute the forward pass using
# Tensors, but we do not need to keep references to intermediate values since
# we are not implementing the backward pass by hand.
y_pred = x.mm(w1).clamp(min=0).mm(w2)

# Compute and print loss using operations on Variables.
# Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
# (1,); loss.data is a scalar value holding the loss.
loss = (y_pred - y).pow(2).sum()
print(t, loss.data)

# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Variables with requires_grad=True.
# After this call w1.grad and w2.grad will be Variables holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()

# Update weights using gradient descent; w1.data and w2.data are Tensors,
# Tensors.
w1.data -= learning_rate * w1.grad.data
w2.data -= learning_rate * w2.grad.data

# Manually zero the gradients after updating weights
Both examples are correct. The first example is more explicit, while in the second example `w1.grad` is `None` up to the first call to `loss.backward()`, during which it is properly initialized. After that, `w1.grad.data.zero_()` zeroes the gradient for the successive iterations.
You’re right, `optimizer.step()` needs the gradients to be there, so you don’t want to zero gradients before the step. However, you can eventually zero gradients for specific variables that you don’t want the optimizer to update.