Weight gradient in backward propagation


I know that .backward() can dynamically calculate the gradient. I wonder how we can obtain the weight’s gradient layer by layer during the .backward() calculation. Hope someone could help

  1. I know we can monitor the input and output of each layer during the .backward() by using .register_backward_hook(). I wonder if layer.register_backward_hook(module, grad_out, grad_in) is the right way to get the weights’ gradient of each layer. The reason I am asking is that I don’t see the output difference between layer.register_backward_hook(module, grad_out, grad_in) and layer.register_backward_hook(module, input, output) in my own example.

  2. If I am able to manually do the backward propagation layer by layer (see code below, followed by an example from How to split backward process wrt each layer of neural network?). I wonder if this is correct that I get weight gradient of each layer via self.layers[i].weight.grad during the backward execution (check the last line of backward() function).

import torch.nn as nn
from torch.autograd import Variable

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layers = nn.ModuleList([
            nn.Linear(10, 10),
            nn.Linear(10, 10),
            nn.Linear(10, 10),
            nn.Linear(10, 10),

    def forward(self, x):
        self.output = []
        self.input = []
        for layer in self.layers:
            # detach from previous history
            x = Variable(x.data, requires_grad=True)

            # compute output
            x = layer(x)

            # add to list of outputs
        return x

    def backward(self, g):
        for i, output in reversed(list(enumerate(self.output))):
            if i == (len(self.output) - 1):
                # for last node, use g

model = Net()
inp = Variable(torch.randn(4, 10))
output = model(inp)
gradients = torch.randn(*output.size())


You should not rely on regiter_backward_hook as, as mentionned in the doc, they are mostly broken at the moment :confused:

Also for such apporaches, I would recommend using autograd.grad so that you can directly provide the Tensors you want the gradients for.


Thanks for the response. I got a little bit confused on torch.autograd.grad and torch.autograd.backward. So for a given y = f(x), torch.autograd.grad returns dy/dx. For torch.autograd.backward, based on my understanding, torch.autograd.backward uses the chain rule to calculate the gradient. However, I am not very clear about what torch.autograd.backward returns ? If possible, I wonder if you could provide some hints for the difference between these two.

Thank you in advance!


The difference is that out.backward() will compute the gradient for all the leaf Tensors that were used to compute out and accumulate these gradients in their .grad field.
autograd.grad(out, inp) will compute the gradient of out wrt inp and return that gradient directly.

One way to see it is that backward() is just a nice wrapper around autograd.grad to work nicely with torch.nn

# This is not the actual implementation
def backward(out, grad_out, *args):
  inp = find_all_leafs(out)
  grads = autograd.grad(out, inp, grad_out, *args)
  for i, g in zip(inp, grads):
    i.grad += g

Many thanks for the good explanation. Problem solved.