How to Penalize Norm of End-to-End Jacobian

I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).

I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:

  1. Forward pass to get predicted classes, yhat, from inputs x
  2. Call yhat.backward(torch.ones(appropriate shape), retain_graph=True)
  3. Jacobian norm =
  4. Set loss equal to classification loss + scalar coefficient * jacobian norm
  5. Run loss.backward()

I suspect that I’m misunderstanding how backward() works when run twice, but I haven’t been able to find any good resources to clarify this.

Too much is required to produce a working example, so I’ve tried to extract the relevant code:

def train_model(model, train_dataloader, optimizer, loss_fn, device=None):

    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    train_loss = 0
    correct = 0
    for batch_idx, (batch_input, batch_target) in enumerate(train_dataloader):
        batch_input, batch_target =,
        model_batch_output = model(batch_input)
        loss = loss_fn(model_output=model_batch_output, model_input=batch_input, model=model, target=batch_target)
        train_loss += loss.item()  # sum up batch loss
    def end_to_end_jacobian_loss(model_output, model_input):
        jacobian =
        jacobian_norm = jacobian.norm(2)
        return jacobian_norm

I swapped my previous implementation with .backward() to autograd.grad and it apparently works! What’s the difference?

    def end_to_end_jacobian_loss(model_output, model_input):
        jacobian = autograd.grad(
        jacobian_norm = jacobian.norm(2)
        return jacobian_norm


You should never need .data anymore. Also keep in mind that .data break the computational graph so no gradients will flow back.

Here you want to compute the gradient in such a way that you can backward through the gradient computation. This is done with the create_graph=True flag to autograd.grad.

Also, since you give a Tensor of ones to grad_outputs, you get the sums of the columns of your Jacobian. Because the autograd only computes a vector Jacobian product.


How do I obtain the complete matrix Jacobian?

One can compute the full jacobian matrix by doing backward multiple times through the graph. For example, given a f that maps R^n to R^m, then we can compute the jacobian via the following:

def unit_vectors(length):
    result = []
    for i in range(0, length):
        x = torch.zeros(length)
        x[i] = 1
    return result

x = torch.randn(3, requires_grad=True)
y = f(x)
result = [torch.autograd.grad(outputs=[y], inputs=[x], grad_outputs=[unit], retain_graph=True)[0] for unit in unit_vectors(y.size(0))]
jacobian = torch.stack(result, dim=0)

Following up from the questions in,

@zou3519 also, where do you use the unit_vectors() function?

unit_vectors is used in the second-to-last line.

Why are multiple grad calls necessary? Shouldn’t one be sufficient?

Given a scalar output f_1 and an input vector x, autograd gives the vector (\frac{df}{dx_1}, \frac{df}{dx_2}, ..., \frac{df}{dx_n}).

For the full matrix Jacobian we have an output (f_1, f_2, ..., f_m) and we want all \frac{df_i}{dx_j}. One way to get this is to run autograd once on each f_i, which is what is being done in the example above.

There might be a way to compress all of this computation into one backward pass; were you thinking of something specific?


were you thinking of something specific?

Yes, I was hoping to penalize the norm of the entire Jacobian, computed in a single backwards pass. I can’t really rely on a solution that scales with the output dimension of the network. Is there no other way?

Why doesn’t autograd compute the full Jacobian ordinarily? Maybe I need to learn more about forward and backward auto-differentiation.

Reverse mode autograd (what we have in pytorch) is capable of computing vector-Jacobian products. That is, given a function f, an input x, and an arbitrary vector v, autograd can tell you v J where J is the Jacobian of f with x.

With multiple vector Jacobian products (If we query autograd for u1 J, u2 J, u3 J, etc where u is a unit vector) we can reconstruct the full Jacobian, which is what the code I wrote above does.

Touching back on albanD’s point, if I pass a vector of all ones instead of a single one, I receive the sum of all the columns of the Jacobian matrix?

I now understand. I thought “vector-Jacobian products” meant “we can compute a Jacobian” and “since the Jacobian is a matrix, we can multiply it with vectors.” But this isn’t the case; the two can’t be computed separately currently.


Yes, you’d get a sum of all of the columns as @albanD wrote in his reply.

Ok thank you. Can I ask what the status on forward-mode automatic differentiation in Pytorch is?

I’m not sure what the status of it is but as with all pytorch features if a lot of users want it the dev team takes that into consideration.

Do you have a specific use case for forward mode AD?

Do you have a specific use case for forward mode AD?

I’m not sure what constitutes a specific use case, but this is for a research project. What would you like to know about it?

Oh I am curious how forward-mode AD would enable your research project. Forward mode AD computes Jacobian vector products (as opposed to vector Jacobian products that are computed with reverse-mode AD).

Maybe forward-mode isn’t what I’m looking for. What mode does jax use to compute the full Jacobian?

I am not sure how jax computes the full Jacobian, but I have always been curious about that. I might do some digging later.

To summarize our discussion above, it sounds like you just want to compute the full Jacobian and use it as a penalty. I posted some code above that computes a Jacobian by invoking autograd.grad multiple times; why doesn’t that work with your use case?

why doesn’t that work with your use case?

I have about 10 different architectures with output dimensions of order 10^3, so running 10 * 1000 backward passes per gradient step is probably far too slow. Unless maybe you think otherwise?

I am not sure how jax computes the full Jacobian, but I have always been curious about that. I might do some digging later.

If you do find out, please let me know :slight_smile:

Also, I appreciate you taking the time to help me.

Yeah, that sounds pretty bad. I found this tracking issue for your particular problem (computing a Jacobian efficiently): so you can subscribe to that. Unfortunately I don’t have any suggestions for easy ways to speed your case up :confused: