Computing output gradients for NTK

Hello everyone, sorry in advance for the beginner question.

Let’s say I have a small model net, with d parameters, with dimension c as output (like a classifier for MNIST). Let’s also say that I feed this model with a minibatch of images (batchsize B). At this point I would have something like

outputs = net(images)

where the shape of output will be (B, c). I would like to take the gradients with respect to the parameters of the network, something like this:

outputs.backward()
feature_maps = [param.grad for param in iter(net.parameters())]

And having as output this feature_maps with shape (B, c, d). This is not doable for many reasons I guess, in particular pytorch complains that when I call .backward() I need to have a scalar tensor in front of it.

This problem is solvable, I just have to feed one image at a time in the model, and compute the gradient for each output separately, and then putting everything together again. Is there a smarter/faster way to achieve this?

I’m basically computing the feature maps for computing the neural tangent kernel later (even if I’m doing it for each output separately for this classifier), so I believe there should be a better usual way for doing this.

You could either reduce the ouputs to a scalar value and call .backward() on it or you could alternatively pass the gradients to the backward operation.
@albanD explains the reasoning behind this in this post.

1 Like

Thank you for your answer!

Then, it appears that .backward() always “returns” a vector with the same dimension of the parameters (to be more specific, it saves in the .grad attribute of each parameter only one number. Either the derivative of a scalar loss with respect to this parameter, either the scalar product between a fixed vector vec (given as input in .backward()), and the vector made of all the derivatives with respect to this parameter of a vectorial loss).

In particular, there is no way to parallelize the computation of the Jacobian? I guess there’s not…

Furthermore, let’s say my vec I give as input in .backward() is just a vector of ones, since I’m interested in the sum of the gradients. Is it better to approach the problem this way, or to sum up the outputs of my vectorial loss before? (This should be equivalent for linearity, but maybe one method is computationally / numerically better…

There is an experimental way to do it :wink: see the vectorize flag on jacobian here or using the vmap prototype directly in your code doc here.

Furthermore, let’s say my vec I give as input in .backward() is just a vector of ones, since I’m interested in the sum of the gradients. Is it better to approach the problem this way, or to sum up the outputs of my vectorial loss before? (This should be equivalent for linearity, but maybe one method is computationally / numerically better…

The two are the same really.
Giving the vector of ones will be marginally faster because you create one less Node in the graph, but tha won’t make any measurable difference really.

1 Like