I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).

I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:

Forward pass to get predicted classes, yhat, from inputs x

You should never need .data anymore. Also keep in mind that .data break the computational graph so no gradients will flow back.

Here you want to compute the gradient in such a way that you can backward through the gradient computation. This is done with the create_graph=True flag to autograd.grad.

Also, since you give a Tensor of ones to grad_outputs, you get the sums of the columns of your Jacobian. Because the autograd only computes a vector Jacobian product.

One can compute the full jacobian matrix by doing backward multiple times through the graph. For example, given a f that maps R^n to R^m, then we can compute the jacobian via the following:

def unit_vectors(length):
result = []
for i in range(0, length):
x = torch.zeros(length)
x[i] = 1
result.append(x)
return result
x = torch.randn(3, requires_grad=True)
y = f(x)
result = [torch.autograd.grad(outputs=[y], inputs=[x], grad_outputs=[unit], retain_graph=True)[0] for unit in unit_vectors(y.size(0))]
jacobian = torch.stack(result, dim=0)

@zou3519 also, where do you use the unit_vectors() function?

unit_vectors is used in the second-to-last line.

Why are multiple grad calls necessary? Shouldn’t one be sufficient?

Given a scalar output f_1 and an input vector x, autograd gives the vector (\frac{df}{dx_1}, \frac{df}{dx_2}, ..., \frac{df}{dx_n}).

For the full matrix Jacobian we have an output (f_1, f_2, ..., f_m) and we want all \frac{df_i}{dx_j}. One way to get this is to run autograd once on each f_i, which is what is being done in the example above.
.

There might be a way to compress all of this computation into one backward pass; were you thinking of something specific?

Yes, I was hoping to penalize the norm of the entire Jacobian, computed in a single backwards pass. I can’t really rely on a solution that scales with the output dimension of the network. Is there no other way?

Why doesn’t autograd compute the full Jacobian ordinarily? Maybe I need to learn more about forward and backward auto-differentiation.

Reverse mode autograd (what we have in pytorch) is capable of computing vector-Jacobian products. That is, given a function f, an input x, and an arbitrary vector v, autograd can tell you v J where J is the Jacobian of f with x.

With multiple vector Jacobian products (If we query autograd for u1 J, u2 J, u3 J, etc where u is a unit vector) we can reconstruct the full Jacobian, which is what the code I wrote above does.

I now understand. I thought “vector-Jacobian products” meant “we can compute a Jacobian” and “since the Jacobian is a matrix, we can multiply it with vectors.” But this isn’t the case; the two can’t be computed separately currently.

Oh I am curious how forward-mode AD would enable your research project. Forward mode AD computes Jacobian vector products (as opposed to vector Jacobian products that are computed with reverse-mode AD).

I am not sure how jax computes the full Jacobian, but I have always been curious about that. I might do some digging later.

To summarize our discussion above, it sounds like you just want to compute the full Jacobian and use it as a penalty. I posted some code above that computes a Jacobian by invoking autograd.grad multiple times; why doesn’t that work with your use case?

I have about 10 different architectures with output dimensions of order 10^3, so running 10 * 1000 backward passes per gradient step is probably far too slow. Unless maybe you think otherwise?

Yeah, that sounds pretty bad. I found this tracking issue for your particular problem (computing a Jacobian efficiently): https://github.com/pytorch/pytorch/issues/23475 so you can subscribe to that. Unfortunately I don’t have any suggestions for easy ways to speed your case up