Hi,
If y is of size N
, then y.backward(gradient)
will use the chain rule to compute the gradient for every parameter in the network. For a give param, w
of size d
, it will perform: gradient * dy/dw
where dy/dw
will will be computed by the chain rule.
If loss
is a tensor with a single element, loss.backward()
is the same as loss.backward(torch.Tensor([1]))
and thus will compute for every parameter w
: 1 * dloss / dw = dloss / dw
. And so the .grad
attribute of each w will just contain this gradient.
I am not sure where you saw people doing y.backward(gradient=target)
but that basically corresponds to having a loss function loss = sum(y * target)
. Because 1 * dloss / dy * dy / dw = target * dy / dw
because given the loss definition above dloss / dy = target
. I don’t know where such loss function is used though.
Warning: register_backward_hook
is kinda broken at the moment for nn.Module
s and you should avoid relying on them.
But it should give you extactly the same thing as input.grad
.
Also note that the .grad
field is populated only for leaf tensors (that you created with requires_grad=True) or tensors for which you explicitly called retain_grad=True
.