What does tensor.backward(..) do mathematically?

albanD · October 24, 2018, 5:46pm

Hi,

If y is of size N, then y.backward(gradient) will use the chain rule to compute the gradient for every parameter in the network. For a give param, w of size d, it will perform: gradient * dy/dw where dy/dw will will be computed by the chain rule.

If loss is a tensor with a single element, loss.backward() is the same as loss.backward(torch.Tensor([1])) and thus will compute for every parameter w: 1 * dloss / dw = dloss / dw. And so the .grad attribute of each w will just contain this gradient.

I am not sure where you saw people doing y.backward(gradient=target) but that basically corresponds to having a loss function loss = sum(y * target). Because 1 * dloss / dy * dy / dw = target * dy / dw because given the loss definition above dloss / dy = target. I don’t know where such loss function is used though.

Warning: register_backward_hook is kinda broken at the moment for nn.Modules and you should avoid relying on them.
But it should give you extactly the same thing as input.grad.

Also note that the .grad field is populated only for leaf tensors (that you created with requires_grad=True) or tensors for which you explicitly called retain_grad=True.