I just took a look at the implementation of torch.autograd.functional.jacobian
and as it appears it does the following (simplified for single input and output tensor):
input.requires_grad = True
output = func(input)
jac_rows = []
for o in output:
input.grad.zero_()
o.backward(retain_graph=True)
jac_rows.append(input.grad.data)
jac = torch.stack(jac_rows)
So if we have 1d-input and -output tensors of shape (N,)
and (M,)
respectively then this performs M
backward passes through the graph. In order to estimate the Jacobian one could also use Finite Difference Approximation which requires N
forward passes instead. One can also assume that the cost of a forward and backward pass are roughly the same.
Now the thing is, I’m using this jacobian
function specifically because I want to use the Levenberg-Marquardt algorithm to fit a small network (~ 100 parameters) which is more efficient than other gradient-based optimizers. The Levenberg-Marquardt algorithm requires that M >= N
and hence this backward Jacobian approach is less efficient than forward difference approximation (M vs. N steps). Now I’m wondering why this implementation has been chosen for jacobian
and if there’s a more efficient way. After all there must be a reason to use jacobian
rather than just one of the available optimizers. If the reason is to implement a custom optimizer using Jacobian then most likely M > N and the implementation is inefficient (e.g. in one of my cases I had N = 84, M = 576
so using Autograd Jacobian is about 7 times slower than Finite Difference Approximation).
Also it seems strange to compute the backward pass separately for each of the output tensor’s elements. This is like using an explicit for
loop over the batch dimension for the forward pass (which we don’t do of course). The forward pass uses all parallelization capabilities for the batch dimension, so I’m wondering whether something similar can be used for the backward pass and Jacobian, i.e. performing all the computations for output
at once.