How to compute derivative of matrix output with respect to matrix input most efficiently?

I have a following situation. An input has shape [BATCH_SIZE, DIMENSIONALITY] and an output has shape [BATCH_SIZE, CLASSES]. The gradient of the output with respect to the input should have shape [BATCH_SIZE, CLASSES, DIMENSIONALITY] (or similar). Currently, I use a following function to get the gradient:

def compute_grad(input, model):
    grad = []
    output = model(input)
    for i in range(output.size(1)):
                outputs=output[:, i],
    return torch.stack(grad)

Is there a way to get the gradient without using for loop? Could this function be written better (more elegant and efficient)?


Currently you are running the backward once for each example, but this can be parallelized with large batches on the GPU.
If you run the forward and then backwards on your model, i.e.,

out = model.forward(input)
loss = criterion(out,class_labels)

Of course, you should pass the model parameters first to the optimizer, and criteria is cross entropy. And then you can simply extract the grads from the model:

for param in model.parameters():


Autograd tools can only do Jacobian Vector products. So if you want to get a full Jacobian, you will need to run as many backward as the ouput size.
Unfortunately there is no way to make this more efficient in the general case at the moment.

1 Like

Thank you for the answer.

If the autograd tools can only do Jacobian Vector products, then, in my opinion, it’s quite confusing that you are able to specify a matrix with shape [n,m] for the grad_outputs parameter when the output is a matrix. For example, in the above scenario if I do

grad = torch.autograd.grad(

then the grad has now shape [BATCH_SIZE, DIMENSIONALITY].

The thing is that for the Jacobian, your output is always considered as 1D (similar with the input).
That way you can see the Jacobian as a big 2D matrix.
And the vector you need to give for the dot product should be as big as the output itself.

Does it mean that the input and the output are flattened?

No they are not. This is just the reason why the shape of the gra_outputs doesn’t matter as long as it is the same as the output.

Okay, but you say that the input is considered as 1D even though in my case the input is 2D. How is that resolved?

There are two things here.
You net takes an ND tensor as input and outputs an ND tensor.
Pytorch asks an ND tensor for the grad_output and will return an ND tensor for the grad_inputs.

Now if you look at this mathematically, if you study the function that is defined by your network, you usually consider it takes a 1D tensor and outputs a 1D tensor.
This function’s jacobian will be a 2D tensor of size nb_in x nb_out.
When I say the pytorch performs Jacobian vector product. It is based on this mathematical formulation where the jacobian is a 2D tensor and the vector is a vector of size nb_out.

That being said, these mathematical objects are never actually created and pytorch works only with the ND tensors you give him. But it will return the result corresponding to these mathematical operations (though not in 1D but with the ND that you gave it initially).

Is that a clearer explanation?

Yeah, I understand that. I am asking all these questions, because I want to understand mathematically what torch.autograd.grad(y, x, v) returns if y is N tensor, x is ND tensor and v is N tensor.

So if we index y and grad_y with i and x and grad_x with j and k (assuming x is 2D but you can add more dimensions if you want).
Doing grad_x = torch.autograd.grad(y, x, grad_y).
Then grad_x[j, k] = sum_i (grad_y[i] * dy[i]/dx[j,k]).

1 Like

Thank you very much! :slight_smile: