I wanted to do SGD but I wasn’t sure if I understood when one should be zeroing out gradients. There are two examples in the tutorials. One zeros before the backward+update pass and the other after the backward+update pass. Are these two the same? What is the difference? code (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn):
for t in range(500): # Forward pass: compute predicted y by passing x to the model. y_pred = model(x) # Compute and print loss. loss = loss_fn(y_pred, y) print(t, loss.data) # Before the backward pass, use the optimizer object to zero all of the # gradients for the variables it will update (which are the learnable weights # of the model) optimizer.zero_grad() # Backward pass: compute gradient of the loss with respect to model # parameters loss.backward() # Calling the step function on an Optimizer makes an update to its # parameters optimizer.step()
for t in range(500): # Forward pass: compute predicted y using operations on Variables; these # are exactly the same operations we used to compute the forward pass using # Tensors, but we do not need to keep references to intermediate values since # we are not implementing the backward pass by hand. y_pred = x.mm(w1).clamp(min=0).mm(w2) # Compute and print loss using operations on Variables. # Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape # (1,); loss.data is a scalar value holding the loss. loss = (y_pred - y).pow(2).sum() print(t, loss.data) # Use autograd to compute the backward pass. This call will compute the # gradient of loss with respect to all Variables with requires_grad=True. # After this call w1.grad and w2.grad will be Variables holding the gradient # of the loss with respect to w1 and w2 respectively. loss.backward() # Update weights using gradient descent; w1.data and w2.data are Tensors, # w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are # Tensors. w1.data -= learning_rate * w1.grad.data w2.data -= learning_rate * w2.grad.data # Manually zero the gradients after updating weights w1.grad.data.zero_() w2.grad.data.zero_()
maybe the only time its “wrong” is to zero out after the backward but before the SGD updated?