I want to compute the gradient of a tensor A of shape (batch_size, M) w.r.t. B of shape (batch_size, L) efficiently, so basically the Jacobian dA/dB of shape (batch_size, M, L) ? While the `torch.autograd.grad` function only accepts scalar outputs, iterating over each element of A and computing the gradient of a_ij w.r.t. B seems quite inefficient. Also, for my use case, finding an analytic solution seems to be hard.

The simplest case of the problem would be something like

``````import torch

W = torch.rand((20, 10))
A = torch.matmul(B, W.T)

``````

Is there a way to compute dA/dB without iteration?

This should do the trick, I think (still learning my ropes around autograd):

``````B = torch.rand((4, 10), requires_grad=True)
W = torch.rand((20, 10))
A = torch.matmul(B, W.T)

I think `A.backward(gradient=torch.ones_like(A))` is the same as `A.sum().backward()`. Consequently, you don’t compute the derivative of every element of A w.r.t. every element of B, but the derivative of the batch-wise sum of B w.r.t. A. Consequently, you get a gradient of shape (4, 10), not (4, 20, 10).