Shape of tensors to return for backward()

sus · July 22, 2023, 1:59am

I’m trying to write a PyTorch Function for some black box function we’ll call f(x,y,z), where x,y,z are vectors of varying length and f returns a vector of length 4. I’m confused on the dimensions that I should be returning for the backward function. For example’s sake, we’ll say that x is a tensor with a single dimension of length 2, then in the backwards function I return a tensor with dimensions 2x4, since the vector that f returns is of length 4. Am I correct on this? Or is there some other way that I’m supposed to calculate the gradient that ALSO results in a tensor that is a single dimension of length 2, i.e. compress the output of [f(x+dx)-f(x)]/dx into a scalar?

KFrank · July 23, 2023, 11:57pm

Hi Sus!

In short, your backward function should return a tuple of tensors that
individually have the same shapes as the tensors input to your custom
(forward) function.

Quoting from Extending torch.autograd:

backward() (or vjp()) defines the gradient formula.
…
It should return as many tensors as there were inputs, with each of them containing the gradient w.r.t. its corresponding input.

Just to be explicit, the “gradient w.r.t. its corresponding input” will have
the same shape and type as that input.

No. In your example case, the backward function should return a
one-dimensional tensor of length 2 (to match the input x, and two
more gradient tensors that match y and z).

Yes. Your backward function is supposed to compute the so-called
vector-Jacobian product. Your backward function will have passed into
it a tensor (more precisely, a tuple of tensors) that is the same shape as
the output of your custom function. This is the “vector” in vector-Jacobian
product.

In your example case, this will be a one-dimensional tensor of length 4 (to
match the shape of the output of f). The Jacobian of f (with respect to its
first argument, x) will indeed be a tensor of shape 2x4, but your backward
function is supposed to contract the 2x4 Jacobian (if it even explicitly
computes the Jacobian, which it need not necessarily do) with the length-4
vector passed into it to form the length-2 vector-Jacobian product that it then
returns.

Best.

K. Frank

sus · July 24, 2023, 12:35am

Thank you so much for the detailed response! From what you’ve told me and what I’ve looked up online about the vector-Jacobian product, I would multiply each partial derivative (row of the 2x4?) of f by the output of f as the row vector, thus making the 2x4 into a 2x1 tensor, then I could just reshape it to be my vjp for x?

Also, what do you mean that I don’t need to explicitly compute the Jacobian? I think that’s exactly what I’m doing and it would be great if you knew about any resources that could clear that up. Thank you so much!

KFrank · July 24, 2023, 3:00am

Hi Sus!

As you’ve stated things, no.

The output of f (that is, the output f produced during the forward
pass) is not directly relevant. Autograd keeps track of the entire
computation graph and, in particular, of everything that happens
to the output of f during the rest of the forward pass.

In the backward pass, autograd passes into f’s backward function
a “vector” that represents the gradient of the final loss function with
respect to the output of f.

The documentation for Function.backward() calls this argument grad_outputs.

So, in your case, your Jacobian matrix would have shape [2, 4]
and grad_outputs would have shape [4]. You would multiply
grad_outputs by the Jacobian matrix to produce the return value
of backward() of shape [2]. (So I think that your result tensor will
be a one-dimensional vector without any “singleton” dimension that
you would need to .reshape() or .squeeze() away.)

(For simplicity, I’m only talking about the case where f outputs only
a single tensor so that autograd only passes a single grad_outputs
tensor to your backward() function.)

You’re certainly allowed to compute the Jacobian and multiply it onto
grad_outputs. But backward() only has to return the result of that
multiplication. If it can construct that result by some other (possibly
cheaper) means, it’s not required to construct the Jacobian explicitly
nor explicitly perform the multiplication.

As a trivial example, suppose your custom function maps a length-n
vector to another length-n vector by multiplying it by 12.0. The
Jacobian of this function is an n x n diagonal matrix that is 12.0
times the identity matrix. backward() can multiply grad_outputs
by the scalar 12.0 and return it – no need to “materialize” an n x n
matrix nor perform a matrix-vector multiplication.

Best.

K. Frank

sus · July 24, 2023, 4:11am

The output of f (that is, the output f produced during the forward pass) is not directly relevant. Autograd keeps track of the entire computation graph and, in particular, of everything that happens to the output of f during the rest of the forward pass.

Whoops! That is exactly what I thought I was typing, got tripped up by the grad_outputs variable name I have in my code. Thank you!

If it can construct that result by some other (possibly cheaper) means, it’s not required to construct the Jacobian explicitly nor explicitly perform the multiplication.

Oh, I see, thank you so much! It definitely clears things up more for me