What is the grad_outputs kwarg in autograd.grad?

I am having trouble understanding exactly what this line means in the docs

grad_outputs (sequence of Tensor) – The “vector” in the Jacobian-vector product. Usually gradients w.r.t. each output. None values can be specified for scalar Tensors or ones that don’t require grad. If a None value would be acceptable for all grad_tensors, then this argument is optional. Default: None.

I see this thread which partially explains it (None is equivalent to passing in torch.ones(...) of the proper size) but I still don’t really understand what it is for or what it should be used for.

Any input? Thanks

1 Like

Hi,

None is equivalent to passing in torch.ones(...) of the proper size

This is only true for an output with a single element!

Otherwise, you can see these outputs as providing dL/dout (where L is your loss) so that the autograd can compute dL/dw (where w are the parameters for which you want the gradients) as dL/dw = dL/dout * dout/dw.

Another way to see this as mentioned in the doc is that autograd only computes a vector matrix product between a vector v and the Jacobian of the function. grad_outputs allow you to specifiy this vector v.

Thanks for your answer, so the vector passed in will not be mutated, but it will have an effect on the final gradients that come out of the grad function?

Is there a simple use case to illustrate why someone would need this?

In most cases, you can do without it, but for example, you can replace:

loss = l1 + 2 * l2
autograd.grad(loss, inp)

by

autograd.grad((l1, l2), inp, grad_outputs=(torch.ones_like(l1), 2 * torch.ones_like(l2))

Which is going to be slightly faster.
Also some algorithms require you to compute x * J for some x. You can avoid having to compute the full Jacobian J by simply providing x as a grad_output.

3 Likes

Thanks for the help. Just one more thing. It seems that by the code you posted, passing in torch.ones(...) will not have a material affect on the final outcome, right? seems like that conflicts with the comment about a single element, but I am not sure

I assume above that l1 and l2 are scalar value! Sorry :smiley:
I just use ones_like() to get a Tensor with a 1 on the right device and with the right dtype.

1 Like

this example really useful for me to understand the grad_outputs argument, I think it could be added to the document of autograd to help more people like me

Thanks for that answer, I would add that torch.ones could be seen as the derivative of the identity map, in this way the backward differentiation can be initialized. It acts as a seed in some sense !