Behaviour of autograd with broadcasted tensors

Hello, by trying applying autograd for performing reverse-mode autodiff on multi-dimensional input arrays I found this kind of behaviour:

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], requires_grad=True)
b = torch.tensor([10., 20., 30.], requires_grad=True)
q = 3*a**3 - b**2
q.backward(torch.ones_like(a))
print(torch.all(a.grad == 9*a**2).item(), torch.all(b.grad == -2*b).item())
>>> True False

Indeed the value of b.grad is -4b, as if the gradients were accumulated twice. However this is not the case as b.register_post_accumulate_grad_hook(lambda p: print(f"b accumulating grad of {p}: {p.grad=}")) shows that the gradients were only accumulated once.

On the other hand, the results are different if b is already broadcasted to the right shape:

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], requires_grad=True)
b = torch.tensor([[10., 20., 30.], [10., 20., 30.]], requires_grad=True)
q = 3*a**3 - b**2
q.backward(torch.ones_like(a))
print(torch.all(a.grad == 9*a**2).item(), torch.all(b.grad == -2*b).item())
>>> True True

So what exactly happens here? I guess the different behaviour is due to the way in which the Jacobian-vector product is implemented in autograd. Could you provide some details about the implementation when using arrays with more than 1 dimension (like matrices and multi-dimensional tensors) which I could not find on the online documentation?

Thank you very much.

P.S. I run the code with PyTorch build version 2.5.1+cu124

You tend to want to use torch.allclose instead of == as there’s no guarantee of providing bit-for-bit equivalent results.

The accumulate grad hooks is called when the .grad is set, but if a tensor is used multiple times, accumulation happens before.

Hello, thank you for the reply.

However the issue here is not floating-point precision, but I think I understood the mechanism: broadcasting is adding another operation to the graph and so the Jacobian of the whole transformation has to multiply the Jacobian of the broadcasting transformation (described by ExpandBackward). The transpose of this Jacobian matrix multiplied by the vector of ones then gives exactly what b.grad shows.

Of course if one starts with b in already the shape of a, then broadcasting does not happen and then there is one less Jacobian in the chain and of course the gradients will be different.

If my intuition is correct, the question could be marked as resolved thank you :slight_smile:.

Sorry you are totally right, because you implicitly broadcasted during forward. Autograd implicitly reduces during backward and you end up with .grad values that are double than if you had pre-broadcasted.