Non-Scalar Gradient

simon-schaefer · September 3, 2021, 10:13am

I want to compute the gradient of a tensor A of shape (batch_size, M) w.r.t. B of shape (batch_size, L) efficiently, so basically the Jacobian dA/dB of shape (batch_size, M, L) ? While the torch.autograd.grad function only accepts scalar outputs, iterating over each element of A and computing the gradient of a_ij w.r.t. B seems quite inefficient. Also, for my use case, finding an analytic solution seems to be hard.

The simplest case of the problem would be something like

import torch

B = torch.rand((4, 10), requires_grad=True)
W = torch.rand((20, 10))
A = torch.matmul(B, W.T)

torch.autograd.grad(A, B)

Is there a way to compute dA/dB without iteration?

gphilip · September 3, 2021, 10:20am

This should do the trick, I think (still learning my ropes around autograd):

B = torch.rand((4, 10), requires_grad=True)
W = torch.rand((20, 10))
A = torch.matmul(B, W.T)

A.backward(gradient=torch.ones_like(A))
B.grad

simon-schaefer · September 3, 2021, 10:26am

Thank you for your quick response!

I think A.backward(gradient=torch.ones_like(A)) is the same as A.sum().backward(). Consequently, you don’t compute the derivative of every element of A w.r.t. every element of B, but the derivative of the batch-wise sum of B w.r.t. A. Consequently, you get a gradient of shape (4, 10), not (4, 20, 10).

gphilip · September 3, 2021, 10:30am

This is something I don’t understand well enough to address, then! Let us wait for someone more knowledgeable to take a look.