How the gradient of a row vector w.r.t. another row vector is calculated via Jacobian-vector-product?

oat · May 3, 2021, 4:16am

[ 1. calculate gradient via backward() ]
The following code generates the gradient of the output of a row-vector-valued function y with respect to (w.r.t.) its row-vector input x, using the backward() function in autograd.

(Strictly speaking, x and y are both 1xN matrix, and this is why the Jacobian matrix is a 1x2x1x2 tensor before it is “squeezed” as shown below in section 2, because this is a Jacobian of one matrix w.r.t. another matrix)

x = torch.tensor( [[2, 3]], dtype=torch.float, requires_grad=True)

def func(x):
    y = torch.zeros(1, 2) 
    y[0, 0] = x[0, 0]**2 + 3*x[0, 1] 
    y[0, 1] = x[0, 1]**2 + 2*x[0, 0]
    return y 

y = func(x)

y.backward(gradient=torch.ones_like(y))

x.grad

The output is:
tensor([[6., 9.]])

[ 2. calculate gradient manually via Jacobian-vector product ]
However, I’m unable to obtain the gradient of x mannually using the Jacobian-vector product method as shown below, i.e. conducting matrix multiplication between the transpose of the Jacobian matrix and a vector of “ones” in the same shape of y (i.e. a 1x2 row vector):

x = torch.tensor( [[2, 3]], dtype=torch.float, requires_grad=True)

def func(x):
  y = torch.zeros(1, 2)
  y[0, 0] = x[0, 0]**2 + 3<em>x[0, 1]
  y[0, 1] = x[0, 1]**2 + 2</em>x[0, 0]
  return y

y = func(x)

J = torch.squeeze(torch.autograd.functional.jacobian(func, x))

x_grad = torch.matmul(
  torch.transpose(J, 0, 1) , 
  torch.ones_like(y)
)

x_grad

This is because the 2x2 Jacobian matrix of y w.r.t. x cannot multiply a 1x2 row vector of ones in the same shape of y, as indicated in the error message, which is understandable.

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x2 and 1x2)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-33-3ebd031d82e3> in <module>
     11 J = torch.squeeze(torch.autograd.functional.jacobian(func, x))
     12 
---> 13 x_grad = torch.matmul(
     14     torch.transpose(J, 0, 1),
     15     torch.ones_like(y)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x2 and 1x2)

[ 3. My Question ]
So, if the gradient is indeed calculated by autograd using the Jacobian-vector-product method, does this mean that internally autograd will convert(transpose) the input row vector of ones to a column vector in the same length so that the matrix multiplication can be conducted correctly, and the resulting column vector will be transposed so that the final output is a row vector in the same shape as x, as shown below?

x = torch.tensor( [[2, 3]], dtype=torch.float, requires_grad=True)

def func(x):
    y = torch.zeros(1, 2) 
    y[0, 0] = x[0, 0]**2 + 3*x[0, 1] 
    y[0, 1] = x[0, 1]**2 + 2*x[0, 0]
    return y 

y = func(x)

J = torch.squeeze(torch.autograd.functional.jacobian(func, x))

x_grad = torch.transpose(torch.matmul(
    torch.transpose(J, 0, 1),
    torch.transpose(torch.ones_like(y), 0, 1) # transpose the row vector of ones to a column vector
), 0, 1)  # the result is transposed back to a row vector in the same shape as x

x_grad

… which does generate the correct results:
tensor([[6., 9.]])

tom · May 3, 2021, 7:02am

While the shapes and maths work out in your case (so the answer to your question is could be just “yes”), the general answer is a bit more elaborate:
As you note, autograd uses backpropagation to track the Jacobian-vector-product. However, if you want to interpret this literally as a matrix-vector-product of the Jacobian with the vector, PyTorch - as would be usual for systems implementing this - does not follow the “mathematical shapes” (vector and matrix) literally.
In backpropagation, we compute the JVP of a composition f ○ g at an input x as D(f ○ g) v = Df (Dg v). Let us say that y = f(x) and z = g(y).
Now v would always be a column vector here as would (Dg v). In PyTorch, however, the convention is that this conceptual column vector is represented by rearranging the entries to the shape of the (intermediate) output, so v will be represented as a tensor of the same shape shape as z and (Dg v) would have the same shape as y. The matrices themselves are typically not explicit either (with matrix multiplication being the big exception), because this would be very inefficient:
Consider a function f multiplying a 3 x 3 x 3 tensor by two. The output is again 3 x 3 x 3 so the vector v in the JVP Df v would be 27 entries So would the JVP itself, and the matrix Df would be a 27 x 27 matrix with 2 on the diagonal and zeros elsewhere, but of course, autograd would not compute it by building that matrix but would just multiply v with 2 to apply Df and get (Df v) = 2v.

If we were maths-braggards we might alternatively describe this as not operating on vectors but on elements in the dual space of the domain and image space (as a finite-dimensional vector spaces over the reals) which we identify with the primal space and the Jacobian D being a contravariant functor mapping each function f : X → Y to Df : D(Y)=Y → D(X)=X (note that because we operate on the dual spaces, we pull back the dual space element using the more conventional linear map between the tangent spaces from, say, functional analysis), and there will be no vectors and instead congratulations for the clever use of category theory in programming. But in the end while it might be good to know that there is some more fancy abstract stuff in the background rather than everything being a fragile ad-hoc construction this doesn’t add that much value here and we are not maths-braggards, are we?

Best regards

Thomas