Hello,

First of all, I’m not really comfortable with auto-diff, and I’ve had a hard time understanding the difference between reverse mode AD and forward mode AD. The notable difference that I seem to have understood is that one will be run alongside the forward pass, in order to minimize the numbers of operations used to compute a JVP.

If this understanding is correct, I’d expect the forward mode AD JVP to be faster than the double grad trick, as it will run one loop instead of two.

When I benchmark both methods, the double trick still seems to be running faster than forward mode AD.

Here’s what I ran to evaluate both methods (I’ve read on a github issue that the `torch.autograd.functionnal` api does use forward mode AD):

``````import torch
import torch.nn as nn

def Ju(x, y, u):

Ju_fast = lambda u : torch.matmul(u, model.weight.T) # The real jacobian of a linear model

model = nn.Linear(20, 20)
input_ = torch.randn(16, 20)

x = input_.clone()
y = model(x)
u = torch.randn(16, 20)

``````
``````%%timeit -n 10 -r 1000
x = input_.clone()
y = model(x)
Ju(x, y, u)
# Yields 131 µs ± 48.3 µs per loop (mean ± std. dev. of 1000 runs, 10 loops each)
``````
``````%%timeit -n 10 -r 1000
jvp(model, (input_,), (u,))
# Yields 192 µs ± 94 µs per loop (mean ± std. dev. of 1000 runs, 10 loops each)
``````
``````%%timeit -n 10 -r 1000
Ju_fast(u)
# Yields 14.6 µs ± 6.26 µs per loop (mean ± std. dev. of 1000 runs, 10 loops each)
``````

From what I understood, the forward mode AD should be computing `J @ u` alongside `W @ x + b` for the linear model (for elementwise function, it seems that simply computing `u^TJ` provides `Ju` since `J` is diagonal), and, which would result in a very fast computation of the JVP.

Am I mistaked in my undestanding of what’s happening behind pytorch ? Are these results to be expected and why ?

`jvp` from `torch.autograd.functional` seems to not use forward AD. If I understand correctly the doc you referred to, forward AD relies on `torch.autograd.forward_ad`.

``````import torch.autograd.forward_ad as fwAD

params = {name: p for name, p in model.named_parameters()}
tangents = {name: torch.rand_like(p) for name, p in params.items()}

for name, p in params.items():
delattr(model, name)

out = model(input_)
return jvp
``````

Disclaimer: this is not returning the same values as the ones of your other methods (I haven’t investigated further) - however it is significantly faster (3 to 4 times for me, using `torch` version 1.12.1). Maybe this could still help?

1 Like

From your code and looking at the documentation of `fwAD.make_dual`, I got a function that computs the JVP, and it seems that it does run faster than the double back trick. I also increased the size of the input, as the double back trick might be faster for very small input sizes.

The torch version I’m using for these experiments is 1.12. Here’s the new code:

``````import torch
import torch.nn as nn

"""
Given a function `model` whose jacobian is `J`, it allows one to compute the Jacobian-vector product (`jvp`)
between `J` and a given vector `u` as follows.

Example::

>>> with dual_level():
...   inp = make_dual(x, u)
...   out = model(inp)
...   y, jvp = unpack_dual(out)
"""
out = model(inp)
return jvp

def Ju(x, y, u):

Ju_fast = lambda u: torch.matmul(u, model.weight.T)  # The real jacobian of a linear model

model = nn.Linear(100, 100)
input_ = torch.randn(32, 100)

x = input_.clone()
y = model(x)
u = torch.randn(32, 100)

# Make sure that all the gradients are equal
``````
``````%%timeit -n 100 -r 1000
x = input_.clone()
y = model(x)
Ju(x, y, u)
# => 282 µs ± 436 µs per loop (mean ± std. dev. of 1000 runs, 100 loops each)
``````
``````%%timeit -n 100 -r 1000
jvp(model, (input_,), (u,))
# => 344 µs ± 542 µs per loop (mean ± std. dev. of 1000 runs, 100 loops each)
``````
``````%%timeit -n 100 -r 1000
``````%%timeit -n 100 -r 1000