I am struggliing to understand the behaviour of non scalar Tensors when backward is called on them.

import torch

a = torch.Tensor([1,2,3])
a.requires_grad = True
b = 2*a

b.backward(gradient=torch.Tensor([1, 1, 1]))

a.grad
Out[100]: tensor([ 2., 2., 2.])

What is the value in a.grad?
Is it <db1/da1, db2/da2, db3/da3>? Then what is the purpose of the gradient argument.
If you leave the gradient argument with size 1 it gives a scalar in a.grad.

a = torch.Tensor([1,2,3])
a.requires_grad = True
b = 2*a

b.backward(gradient=torch.Tensor([1]))

a.grad
tensor([ 2.])

What is this scalar representing, as the gradient is a vector?

It would be great if someone can bring more light on this.

If I call gO = torch.Tensor([1,1,1]) the tensor you give to the backward() function, the a.grad contains:
<gO[0] * db1/da1 + gO[1] * db2/da1 + gO[2] * db3/da1, gO[0] * db1/da2 + gO[1] * db2/da2 + gO[2] * db3/da2, gO[0] * db1/da3 + gO[1] * db2/da3 + gO[2] * db3/da3>.
In you particular case, b is the element-wise product of a and 2 so dbi/daj is 0 if i != j and 2 if i==j. So it will boil down in your case to <12 + 10 + 10, 10 + 12 + 10, 10 + 10 + 1*2> = <2, 2, 2>.

Thanks albanD! That makes sense in the case when g0 is of length equal to the length of b. Do you know what happens when g0 is of length 1 and why is a.grad of length 1 then?

It is a bit of a shame that this isn’t written in the official documentation. A formula is worth a thousand words!

gO has to be equal to the length of b. If you have b with a single value, doing b.backward() is a convenient way to write b.backward(torch.Tensor[1]).

The fact that you can give a gradient with a different size than the element is a bug and will give undefined behaviour. @apaszke checks should be added to make sure that the sizes matter here right?