In the autograd tutorial, under the section “Gradients”, we call out.backward() to use automatic differentiation to compute d(out)/dx. After calling .backwards(), the field x.grad is available.

But why do we want to do this? I expected that we would want to compute d(out)/d(parameters). Does this call to .backwards() also update the network parameters, and if not, what is the use case?

backwards() does not update the parameters. Instead, an optimizer will take the gradient, apply the learning rate, and then update the parameters with it. Check out SGD for example.

grad will be available on parameters too. If you don’t need the gradient on x, you should set requires_grad=False.

So in the forward pass, the input, a Variable instance, flows through the model. Each layer in the model is really just a single function in a larger, composed function, and this layer creates a new Variable instance as its output. Each of these newly created variables have a grad_fn that tell the variable how it was created. When .backward() is called, each variable can use this cached .grad_fn to differentiate itself.

If all that is correct, I have a follow up question. Where is the reference to the layer parameters stored? The variables do update the model parameters, not the data but the .grad field. For example:

>>> params = [p for p in model.parameters()]
>>> params[0].grad
# Will be `None`
>>> pred = model(input_)
>>> error = loss(pred, target)
>>> error.backward()
>>> params[0].grad
# Will print a tensor of gradients

How does each Variable know where the parameters are for the layer that created it?