Why do we need to pass the gradient parameter to the backward function in PyTorch?

AbishekBashyal · March 26, 2021, 1:47pm

According to the docs, when we call the backward function to the tensor if the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient.

import torch
a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)

F = a * b
F.backward(gradient=torch.tensor([1.,1.])) 

print(a.grad)

Output: tensor([20., 20.])

Now scaling the external gradient:

a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)

F = a * b
F.backward(gradient=torch.tensor([2.,2.])) #modified

print(a.grad)

Output: tensor([40., 40.])

So, passing the gradient argument to backward seems to scale the gradients.
Also, by default F.backward() is F.backward(gradient=torch.Tensor([1.]))

Apart from scaling the grad value how does the gradient parameter passed to the backward function helps to compute the derivatives when we have a non-scalar tensor?
Why can’t PyTorch calculate the derivative implicitly without asking explicit gradient parameter as it did for the scalar-tensor?

Also, In the case of scalar value, .backward() w/o parameters is equivalent to .backward(torch.tensor(1.0)) then why can’t we use broadcasting of torch.tensor(1.0) so that it works for not-scalar inputs also instead of explicitly having to pass external gradient parameter while calculating the vector Jacobian product?

AlphaBetaGamma96 · March 26, 2021, 1:56pm

I’m pretty sure when you do something like y.backward(torch.ones_like(y)) you’re just telling autograd to repeat .backward() for each element in y under the hood.

AbishekBashyal · March 26, 2021, 2:13pm

@AlphaBetaGamma96 can you share some resources showing that describes:

y.backward(torch.ones_like(y)) is telling autograd to repeat .backward() for each element in y under the hood.

albanD · March 26, 2021, 2:58pm

Hi,

If you consider a function f that has n_input and n_output. And a Jacobian matrix containing all its partial derivatives J_f of size (n_output x n_input).
Then what backpropagation (or AD whichever way you want to name it) does is to compute v^T J_f for a given v.
If your function has a single output, then it makes sense to take v = 1. so that backprop will return J_f.
But if you have multiple outputs, there is no good default and so we requirethe user to provide the v value they want.

If you want to reconstruct the full J_f, you will have to do as many backwards as there are outputs in your function. You can use the autograd.functional.jacobian function if you need that.

AbishekBashyal · March 26, 2021, 3:22pm

@albanD thanks, its clear now :D.