Hi everyone, am I correct that there is currently no way to efficiently compute the Hessian vector product using Pearlmutter trick (from Fast Exact Multiplication by the Hessian), i.e. approximately same time complexity as two gradient computations, due to missing forward-mode AD in PyTorch? Is there any way to work around this?
I am aware that I can simply use plain AD to compute the Hessian vector product (which will be slow), and that I can use finite difference approximation (which will be inexact and numerically unstable). I’ve also seen the R-operator and L-operator defined in a similiar issue ([Adding functionality] Hessian and Fisher Information vector products) and how they replace the R-operator with two L-operator calls, however this still won’t be O(N) or am I missing something? Thanks for your input!
The r and l operators by definition have complexities O(N), since they’re just a backward pass or backward called twice. So yes, the actual complexity is going to be O(3N) but you drop the constant since it’s still O(N).
I guess you’re right theoretically but when I try it in practice I get vastly different runtimes. Maybe an implementation issue? Maybe I misunderstood something about R-op or L-op? This is what the comparison looks like (timing on CPU):
Pearlmutter: 3.89 s ± 151 ms per loop (7 runs, 1 loop each)
Simple AD: 1.21 s ± 64.5 ms per loop (7 runs, 1 loop each)
Finite difference approximation: 327 ms ± 11 ms per loop (7 runs, 1 loop each)
The first two return the same (correct) result and the last one returns a pretty good approximation. Here is more or less what the Pearlmutter implementation looks like:
def hvp(model, criterion, inputs, outputs, v, *args, **kwargs):
# Zero the gradients
prediction = model(inputs)
L = criterion(prediction, outputs)
# Compute gradient
y = torch.autograd.grad(L, model.parameters(), create_graph = True, retain_graph = True)
# Apply R-operator on gradient
vp = rop(parameters_to_vector(y), model.parameters(), v)
def rop(y, x, v):
# Adapted from: https://gist.github.com/apaszke/c7257ac04cb8debb82221764f6d117ad
w = torch.zeros(y.size(), requires_grad = True)
g = torch.autograd.grad(y, x, grad_outputs = w, create_graph = True, allow_unused = True)
r = torch.autograd.grad(parameters_to_vector(g), w, grad_outputs = v, create_graph = False, allow_unused = True)
Quick note about your
rop function, if your grad_ouputs is a Tensor of 0s, all your gradients in g will be 0. Maybe you meant ones?
Also hvp (if it means hessian vector product) does not need 3 calls to backward as you do here. Only 2 are needed.
I am not familiar with the Pearlmutter trick but does it boil down to mixing forward and backward AD to compute the hessian faster?
If so, in the approximation that forward/backward prop/forward prop all have the same cost C.
Computing the Hessian vector product with double backward will cost 4C (C forward, C first backward, 2C second backward).
Computing the Hessian vector product with forward/backward AD will cost 3C (C forward, C forward mode grads, C backward of forward mode grad).
So while This would give some benefit compared to double backward, It might not be as big as what you expect.
Thanks for your thoughts @albanD ! So the implementation above is not the vanilla 2-backward calls implementation of HVPs, it’s supposed to implement the Pearlmutter trick, which is indeed mixing forward and backward AD if I understand correctly. We can replace the forward-mode with two reverse-mode calls (see for reference j-town’s A new trick for calculating Jacobian vector products) so altogether three backward calls as above should be fine. Please correct me if I’m wrong.
Now what I don’t understand is why this approach is taking so much longer than both the naive (“Simple AD” above) approach and the finite different approximation. It’s one order of magnitude slower than finite difference approximation which requires two gradient evaluations and asymptotically speaking should have the same complexity as Pearlmutters method. I guess I’m lacking some knowledge on autograds internals here…
EDIT: So rereading the above linked blog post’s conclusion actually brings some clarification:
For Autograd, the situation isn’t quite as simple. Autograd doesn’t do its differentiation ahead of time, and needs to trace a function’s execution in order to reverse-differentiate it. This makes our new trick, in its basic form, significantly less efficient than the forward mode implementation that I’ve written.
So this is also the case for PyTorch and PyTorch’s autograd I guess? As a result the trick doesn’t bring any advantage because it computes more than “necessary” in some way? An actual forward-mode implementation is the only way to move forward?
A forward-mode implementation is going to give you benefits in theory yes. But the theory only tells you that it will be 3/4th of the runtime. It would go from 4s to 3s. It is important but not that much given the measurements you showed above.
Not sure what the practical result would be though as you would need a forward mode AD system