Autograd Jacobian vs. Finite Difference Approximation

Dominik · August 4, 2020, 7:37am

I just took a look at the implementation of torch.autograd.functional.jacobian and as it appears it does the following (simplified for single input and output tensor):

input.requires_grad = True
output = func(input)
jac_rows = []
for o in output:
    input.grad.zero_()
    o.backward(retain_graph=True)
    jac_rows.append(input.grad.data)
jac = torch.stack(jac_rows)

So if we have 1d-input and -output tensors of shape (N,) and (M,) respectively then this performs M backward passes through the graph. In order to estimate the Jacobian one could also use Finite Difference Approximation which requires N forward passes instead. One can also assume that the cost of a forward and backward pass are roughly the same.

Now the thing is, I’m using this jacobian function specifically because I want to use the Levenberg-Marquardt algorithm to fit a small network (~ 100 parameters) which is more efficient than other gradient-based optimizers. The Levenberg-Marquardt algorithm requires that M >= N and hence this backward Jacobian approach is less efficient than forward difference approximation (M vs. N steps). Now I’m wondering why this implementation has been chosen for jacobian and if there’s a more efficient way. After all there must be a reason to use jacobian rather than just one of the available optimizers. If the reason is to implement a custom optimizer using Jacobian then most likely M > N and the implementation is inefficient (e.g. in one of my cases I had N = 84, M = 576 so using Autograd Jacobian is about 7 times slower than Finite Difference Approximation).

Also it seems strange to compute the backward pass separately for each of the output tensor’s elements. This is like using an explicit for loop over the batch dimension for the forward pass (which we don’t do of course). The forward pass uses all parallelization capabilities for the batch dimension, so I’m wondering whether something similar can be used for the backward pass and Jacobian, i.e. performing all the computations for output at once.

kiwiriver · November 11, 2020, 4:44pm

I have almost the same issue as you had. I have a model with input N=15 and output M=55, and I need to compute the Jacobian with respect to the input parameters (15).

Using torch.autograd.functional.jacobian is almost 8 times slower than the finite difference method.

I am curious whether autograd can be implemented in a way which is faster than finite difference.

Connor_Stone · April 11, 2023, 6:57pm

You want to use autograd jacobian in the “forward-mode” in this case for the “strategy” argument. In the forward mode it will do N passes and so will be as fast as finite differences, except that it will be exact.

soulitzer · April 11, 2023, 7:00pm

There’s also torch.func.jacfwd and torch.func.jacrev (these use vmap by default underneath to vectorize the gradient computations)

https://pytorch.org/docs/stable/func.api.html#module-torch.func