# Weight gradient in backward propagation

Hi,

I know that .backward() can dynamically calculate the gradient. I wonder how we can obtain the weight’s gradient layer by layer during the .backward() calculation. Hope someone could help

1. I know we can monitor the input and output of each layer during the .backward() by using .register_backward_hook(). I wonder if layer.register_backward_hook(module, grad_out, grad_in) is the right way to get the weights’ gradient of each layer. The reason I am asking is that I don’t see the output difference between layer.register_backward_hook(module, grad_out, grad_in) and layer.register_backward_hook(module, input, output) in my own example.

2. If I am able to manually do the backward propagation layer by layer (see code below, followed by an example from How to split backward process wrt each layer of neural network?). I wonder if this is correct that I get weight gradient of each layer via self.layers[i].weight.grad during the backward execution (check the last line of backward() function).

``````import torch.nn as nn
from torch.autograd import Variable

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layers = nn.ModuleList([
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
])

def forward(self, x):
self.output = []
self.input = []
for layer in self.layers:
# detach from previous history
x = Variable(x.data, requires_grad=True)
self.input.append(x)

# compute output
x = layer(x)

# add to list of outputs
self.output.append(x)
return x

def backward(self, g):
for i, output in reversed(list(enumerate(self.output))):
if i == (len(self.output) - 1):
# for last node, use g
output.backward(g)
print(self.input[i].grad.shape)
else:
output.backward(self.input[i+1].grad.data)
print(self.layers[i].weight.grad)

model = Net()
inp = Variable(torch.randn(4, 10))
output = model(inp)
gradients = torch.randn(*output.size())
model.backward(gradients)
``````

Hi,

You should not rely on `regiter_backward_hook` as, as mentionned in the doc, they are mostly broken at the moment

Also for such apporaches, I would recommend using `autograd.grad` so that you can directly provide the Tensors you want the gradients for.

Hi,

Thanks for the response. I got a little bit confused on `torch.autograd.grad ` and `torch.autograd.backward`. So for a given y = f(x), `torch.autograd.grad ` returns dy/dx. For `torch.autograd.backward`, based on my understanding, `torch.autograd.backward` uses the chain rule to calculate the gradient. However, I am not very clear about what `torch.autograd.backward` returns ? If possible, I wonder if you could provide some hints for the difference between these two.

Thank you in advance!

Hi,

The difference is that `out.backward()` will compute the gradient for all the leaf Tensors that were used to compute `out` and accumulate these gradients in their `.grad ` field.
`autograd.grad(out, inp)` will compute the gradient of out wrt inp and return that gradient directly.

One way to see it is that backward() is just a nice wrapper around autograd.grad to work nicely with torch.nn

``````# This is not the actual implementation
def backward(out, grad_out, *args):
inp = find_all_leafs(out)
grads = autograd.grad(out, inp, grad_out, *args)
for i, g in zip(inp, grads):
i.grad += g
``````

Many thanks for the good explanation. Problem solved.