How to Penalize Norm of End-to-End Jacobian

Ok thank you. Can I ask what the status on forward-mode automatic differentiation in Pytorch is?

I’m not sure what the status of it is but as with all pytorch features if a lot of users want it the dev team takes that into consideration.

Do you have a specific use case for forward mode AD?

Do you have a specific use case for forward mode AD?

I’m not sure what constitutes a specific use case, but this is for a research project. What would you like to know about it?

Oh I am curious how forward-mode AD would enable your research project. Forward mode AD computes Jacobian vector products (as opposed to vector Jacobian products that are computed with reverse-mode AD).

Maybe forward-mode isn’t what I’m looking for. What mode does jax use to compute the full Jacobian?

I am not sure how jax computes the full Jacobian, but I have always been curious about that. I might do some digging later.

To summarize our discussion above, it sounds like you just want to compute the full Jacobian and use it as a penalty. I posted some code above that computes a Jacobian by invoking autograd.grad multiple times; why doesn’t that work with your use case?

why doesn’t that work with your use case?

I have about 10 different architectures with output dimensions of order 10^3, so running 10 * 1000 backward passes per gradient step is probably far too slow. Unless maybe you think otherwise?

I am not sure how jax computes the full Jacobian, but I have always been curious about that. I might do some digging later.

If you do find out, please let me know :slight_smile:

Also, I appreciate you taking the time to help me.

Yeah, that sounds pretty bad. I found this tracking issue for your particular problem (computing a Jacobian efficiently): https://github.com/pytorch/pytorch/issues/23475 so you can subscribe to that. Unfortunately I don’t have any suggestions for easy ways to speed your case up :confused:

From my understanding, jax does as many backward as there are outputs just like @richard proposed. But they have the vmap operator that allows then to do this more efficiently than a for loop in python (even though the theoretical complexity is the same).

Do you know how their vmap works? I’m curious if it’s similar to what NestedTensor will eventually be or if they just do some program transformations to accomplish it

From my understanding: implement a batched version of every function: https://jax.readthedocs.io/en/latest/notebooks/How_JAX_primitives_work.html#Batching

I guess this post is related

And you might like the following two repos

Unfortunately, it doesn’t support all types of networks (e.g. no batch norm), but convnets and all types of activations will work with just one backprop.

@albanD @richard how does the linked gist work? Previously, when I tried running backward() twice, the model wouldn’t learn, but switching to autograd.grad() fixed whatever problem existed. If I can use the linked gist and get backward() to work when run twice, then I might have a solution!

Hi,

The difference is that .backward() accumulate gradients in the .grad fields of the leafs. .grad() does not.
So you were most likely accumulating extra gradients when doing multiple call to .backward() if you were not calling .zero_grad() before the last one.

Hmm… I’m not sure this accords with what I see empirically. I’m using grad() followed by loss.backward() and this seems to change the training of the model compared with just running loss.backward(). If grad doesn’t accumulate gradients, then why does the outcome differ?

Maybe I don’t know what exactly you mean by leafs.

Sorry maybe that may not have be clear. The two different cases are:

opt.zero_grad()
loss = xxx(inputs) # Compute your loss
grads = xxx(loss) # Compute gradients wrt your loss
penalty = xxx(grads) # Compute the gradient penalty
final_loss = loss + penalty
final_loss.backward()
opt.step()

In the example above, you want to make your gradient step only for the gradients computed during final_loss.backward(). But if the computation of grads is done with .backward(create_graph=True), then you accumulate some extra gradients. You don’t do this if you compute grads with autograd.grad(create_graph=True).
So the two gradients when you step are different. That could explain your model training properly in one case but not the other

@albanD @richard , I have a question regarding our above conversation. Referring to richard’s code near the top, I’m trying to rewrite his loop using a larger batch size, but I’m receiving One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior. Can you help me understand why I’m seeing this error?

Here’s what I’m doing. Let x be the input to the graph with shape (batch size, input dimension) and let y be the output of the graph with shape (batch size, output dimension). I then select a subset of N random unit vectors. I stack x with itself and y with itself as follows:

x = torch.cat([x for _ in range(N)], dim=0)

and

y = torch.cat([y for _ in range(N)], dim=0)

x then has shape (N * batch size, input dim) and y has shape (N * batch size, output dim). But then, when I try to use autograd, I receive the aforementioned error .

        jacobian = torch.autograd.grad(
            outputs=y,
            inputs=y,
            grad_outputs=subset_unit_vectors,
            retain_graph=True,
            only_inputs=True)[0]

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Why is this, and is there a way to make this large batch approach work instead of looping?

It’s appears that repeating the tensor and concatenating the list destroys the path in the computational graph from x to y.