Why do we need to pass the gradient parameter to the backward function in PyTorch?

According to the docs, when we call the backward function to the tensor if the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient.

import torch
a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)

F = a * b
F.backward(gradient=torch.tensor([1.,1.])) 

print(a.grad)

Output: tensor([20., 20.])

Now scaling the external gradient:

a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)

F = a * b
F.backward(gradient=torch.tensor([2.,2.])) #modified

print(a.grad)

Output: tensor([40., 40.])

So, passing the gradient argument to backward seems to scale the gradients.
Also, by default F.backward() is F.backward(gradient=torch.Tensor([1.]))

Apart from scaling the grad value how does the gradient parameter passed to the backward function helps to compute the derivatives when we have a non-scalar tensor?
Why can’t PyTorch calculate the derivative implicitly without asking explicit gradient parameter as it did for the scalar-tensor?

Also, In the case of scalar value, .backward() w/o parameters is equivalent to .backward(torch.tensor(1.0)) then why can’t we use broadcasting of torch.tensor(1.0) so that it works for not-scalar inputs also instead of explicitly having to pass external gradient parameter while calculating the vector Jacobian product?

I’m pretty sure when you do something like y.backward(torch.ones_like(y)) you’re just telling autograd to repeat .backward() for each element in y under the hood.

1 Like

@AlphaBetaGamma96 can you share some resources showing that describes:

y.backward(torch.ones_like(y)) is telling autograd to repeat .backward() for each element in y under the hood.

Hi,

If you consider a function f that has n_input and n_output. And a Jacobian matrix containing all its partial derivatives J_f of size (n_output x n_input).
Then what backpropagation (or AD whichever way you want to name it) does is to compute v^T J_f for a given v.
If your function has a single output, then it makes sense to take v = 1. so that backprop will return J_f.
But if you have multiple outputs, there is no good default and so we requirethe user to provide the v value they want.

If you want to reconstruct the full J_f, you will have to do as many backwards as there are outputs in your function. You can use the autograd.functional.jacobian function if you need that.

4 Likes

@albanD thanks, its clear now :D.