I want to compute Jacobian matrices using pytorch’s autograd
. Autograd natively computes Jacobian-vector products, so I’d simple like to pass an identity matrix to obtain the full Jacobian (ie, Jv = JI = J).
One wrinkle: I’d like to implement both standard reverse-mode AD computation for the Jacobian, but also a forward-mode version (which should be faster for most of my applications) using the following trick due to Jamie Townshend:
I’ve actually gotten the former (reverse-mode) working, using a few hits found online and some fairly straightforward work:
def rev_jacobian(fxn, x, n_outputs, retain_graph):
"""
the basic idea is to create N copies of the input
and then ask for each of the N dimensions of the
output... this allows us to compute J with pytorch's
jacobian-vector engine
"""
# expand the input, one copy per output dimension
n_outputs = int(n_outputs)
repear_arg = (n_outputs,) + (1,) * len(x.size())
xr = x.repeat(*repear_arg)
xr.requires_grad_(True)
# both y and I are shape (n_outputs, n_outputs)
# checking y shape lets us report something meaningful
y = fxn(xr).view(n_outputs, -1)
if y.size(1) != n_outputs:
raise ValueError('Function `fxn` does not give output '
'compatible with `n_outputs`=%d, size '
'of fxn(x) : %s'
'' % (n_outputs, y.size(1)))
I = torch.eye(n_outputs, device=xr.device)
J = autograd.grad(y, xr,
grad_outputs=I,
retain_graph=retain_graph,
create_graph=True, # for higher order derivatives
)
return J[0]
However, for the forward-mode version I can’t seem to get autograd
to behave. Somehow, I cannot seem to get it to accept the identity matrix as an argument for grad_outputs
and NOT return a sum of the columns of the final Jacobian I’d like. But given the success of the reverse-mode implementation, it seems like it should be easy!!
Here’s the current code, with a commented out version of what I’d like to do:
def fwd_jacobian(fxn, x, n_outputs, retain_graph):
"""
This implementation is very similar to the above, but with
one twist. To implement a forward-mode AD with rev-mode
calls, we first compute the rev-mode VJP for one vector (v)
then we call d/dv(VJP) `n_outputs` times, one per basis vector,
to obtain the Jacobian.
This should be faster if `n_outputs` > "n_inputs"
References
----------
.[1] https://j-towns.github.io/2017/06/12/A-new-trick.html
(Thanks to Jamie Townsend for this awesome trick!)
"""
xd = x.detach().requires_grad_(True)
n_inputs = int(xd.size(0))
# first, compute *any* VJP
v = torch.ones(n_outputs, device=x.device, requires_grad=True)
y = fxn(xd.view(1,n_inputs)).view(n_outputs)
if y.size(0) != n_outputs:
raise ValueError('Function `fxn` does not give output '
'compatible with `n_outputs`=%d, size '
'of fxn(x) : %s'
'' % (n_outputs, y.size(0)))
vjp = torch.autograd.grad(y, xd, grad_outputs=v,
create_graph=True,
retain_graph=retain_graph)[0]
assert vjp.shape == (n_inputs,)
# TODO somehow the repeat trick does not work anymore
# now that we have to take derivatives wrt v
# so loop over basis vectors and compose jacobian col by col
I = torch.eye(n_inputs, device=x.device)
J = []
for i in range(n_inputs):
Ji = autograd.grad(vjp, v,
grad_outputs=I[i],
retain_graph=retain_graph,
create_graph=True, # for higher order derivatives
)
J.append(Ji[0])
return torch.stack(J).t()
Looping over each column of the final Jacobian is slow, of course, and wasteful – I’m computing the Jacobian N times, and throwing most of it away each time. Since this is the bottleneck in my code, it would be sweet to do it all in one go .
Hoping someone with more knowledge of the guts of autograd can advise on the proper use here. Thanks in advance!!
PS. I won’t post just yet for sake of brevity, but if anyone is interested I can post a test case as well to work from.