According to the docs, when we call the backward function to the tensor if the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient.
import torch
a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)
F = a * b
F.backward(gradient=torch.tensor([1.,1.]))
print(a.grad)
Output: tensor([20., 20.])
Now scaling the external gradient:
a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)
F = a * b
F.backward(gradient=torch.tensor([2.,2.])) #modified
print(a.grad)
Output: tensor([40., 40.])
So, passing the gradient argument to backward seems to scale the gradients.
Also, by default F.backward()
is F.backward(gradient=torch.Tensor([1.]))
Apart from scaling the grad value how does the gradient
parameter passed to the backward function helps to compute the derivatives when we have a non-scalar tensor?
Why can’t PyTorch calculate the derivative implicitly without asking explicit gradient parameter as it did for the scalar-tensor?
Also, In the case of scalar value, .backward()
w/o parameters is equivalent to .backward(torch.tensor(1.0))
then why can’t we use broadcasting of torch.tensor(1.0)
so that it works for not-scalar inputs also instead of explicitly having to pass external gradient
parameter while calculating the vector Jacobian product?