When should one be zeroing out gradients?

I wanted to do SGD but I wasn’t sure if I understood when one should be zeroing out gradients. There are two examples in the tutorials. One zeros before the backward+update pass and the other after the backward+update pass. Are these two the same? What is the difference? code (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn):

for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.data[0])

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable weights
    # of the model)
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

vs (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html):

for t in range(500):
    # Forward pass: compute predicted y using operations on Variables; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Variables.
    # Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
    # (1,); loss.data[0] is a scalar value holding the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Variables with requires_grad=True.
    # After this call w1.grad and w2.grad will be Variables holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Update weights using gradient descent; w1.data and w2.data are Tensors,
    # w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are
    # Tensors.
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data

    # Manually zero the gradients after updating weights
    w1.grad.data.zero_()
    w2.grad.data.zero_()

maybe the only time its “wrong” is to zero out after the backward but before the SGD updated?

Both examples are correct. The first example is more explicit, while in the second example w1.grad is None up to the first call to loss.backward(), during which it is properly initialized. After that, w1.grad.data.zero_() zeroes the gradient for the successive iterations.

You’re right, optimizer.step() needs the gradients to be there, so you don’t want to zero gradients before the step. However, you can eventually zero gradients for specific variables that you don’t want the optimizer to update.

3 Likes