Need for a crystal clear explanation for autograd.grad


I find that the doc and the topics around autograd.grad are very unclear. Especially concerning the Jacobian-vector product.

Here is what i struggle with:

I need to compute the Jacobian-vector product vJ for one layer of neural network.

Say f(x, w) is a parametric function from R^(2xn)xR^2 in R^n (n being the batch size, R^2 the parameter space, or hypothesis as it’s sometimes called among computer scientists) and v is in R^2. I want to compute v^\top J_f(x, w) where x is in R^(2xn) and w in R^2.

From what i tried to understand from the doc, it suffices to call autograd.grad(f(x, w), w, grad_outputs = v).

But I get an error:
Mismatch in shape: grad_output[0] has a shape of torch.Size([1, 2]) and output[0] has a shape of torch.Size([n, 1]).

-> How can the output shape be relevant here ? I want to multiply the Jacobian by v, and the shape of the Jacobian is (n, 2).

-> As addition, you should specify clearly what is the operation that is applied to v when we give the argument grad_outputs=v to autograg.grad …

Here is the code to reproduce the error:

import torch
import torch.nn as nn

class example_net(nn.Module):
    def __init__(self):
        super(example_net, self).__init__()
        self.linear1 = nn.Linear(2, 1, bias = False)
    def forward(self, x):
        x = self.linear1(x)
        x = torch.sigmoid(x)

criterion = nn.BCELoss(reduction = 'none')

exnet = example_net()

x = torch.randn(100, 2)
w = exnet.linear1.weight
y = torch.randn(100)
u = torch.randn(2)
losses = criterion(exnet(x), y)

torch.autograd.grad(losses, w, grad_outputs=u)

RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([2]) and output[0] has a shape of torch.Size([100]).

[EDIT] Corrected the output dimension in the definitions.


To clear up the notations when working with Jacobian, it is simpler to see your function as working with only one 1D input and one 1D output. This way the jacobian will be a 2D matrix of [nb_out, nb_in].
If your case, you can consider g that takes an input of size 2*n + 2 and returns an output of size 2.
The Jacobian for that function will have size [2, 2*n + 2].
So to be able to do a vector jacobian product, you need to provide a vector v of size 2.

Note that in pytorch, we relax this constraint when allowing multiple input/output and allowed them to be of higher dimension.
But the root idea remains: v should be the same size as the output.

In your code sample, because you set reduction = 'none' for the criterion, then no reduction happens and so your loss (as mentionned in the error message) is of size 100 (the batch size). I think you should double check the definition of the BCELoss: reduction happens over the batch size. But you never get a score for each class. So you cannot get an output of size 2 from it.

Hi Alban,

Thank you for the fast response.

I think that you didn’t answer the question. And I see now that I’ve made a mistake when asking. I’ll edit that. The correct definition is the following:
The function f maps R^(nx2)xR^2 into R^n. And we differentiate only w.r.t the parameters (which are in R^2).

First, the Jacobian matrix is in fact of size [nb_out, nb_parameters] here as I am differentiating w.r.t the weights of the only layer. So it should be of size [100, 2]. I am sure of that as I am able to derive it on paper as well as without using autograd. Since v is in R^2, there is, in theory, no problem doing the operation Jv.

So what I want to do is to retrieve the Jacobian-vector product Jv. But there is some checking made to ensure that shapes of v and y (output) match. That makes no sense here.

On what concerns BCELoss, I’ve read the definition and ‘reduction = True’ is what I need since I need to play with reweighting the expected value of the likelihood (binary cross entropy if you prefer) but those details are irrelevant here.

Thank you in advance, being able to do that Jac-vector prod. could make my research way easier.


I think the misconception here is between vector jacobian product vs jacobian vector product.
Reverse mode AD that all DL framework use for vector jacobian products. Where v is of the size of the input.

If you want to do jacobian vector product, (and so v is the same size as the input), you need to use a different trick. If you use pytorch nightly, we added functions to do this here.
Otherwise, you can use something based on this old gist:

This is perfect thank you !