I want to compute Jacobian matrices using pytorch’s
autograd. Autograd natively computes Jacobian-vector products, so I’d simple like to pass an identity matrix to obtain the full Jacobian (ie, Jv = JI = J).
One wrinkle: I’d like to implement both standard reverse-mode AD computation for the Jacobian, but also a forward-mode version (which should be faster for most of my applications) using the following trick due to Jamie Townshend:
I’ve actually gotten the former (reverse-mode) working, using a few hits found online and some fairly straightforward work:
def rev_jacobian(fxn, x, n_outputs, retain_graph): """ the basic idea is to create N copies of the input and then ask for each of the N dimensions of the output... this allows us to compute J with pytorch's jacobian-vector engine """ # expand the input, one copy per output dimension n_outputs = int(n_outputs) repear_arg = (n_outputs,) + (1,) * len(x.size()) xr = x.repeat(*repear_arg) xr.requires_grad_(True) # both y and I are shape (n_outputs, n_outputs) # checking y shape lets us report something meaningful y = fxn(xr).view(n_outputs, -1) if y.size(1) != n_outputs: raise ValueError('Function `fxn` does not give output ' 'compatible with `n_outputs`=%d, size ' 'of fxn(x) : %s' '' % (n_outputs, y.size(1))) I = torch.eye(n_outputs, device=xr.device) J = autograd.grad(y, xr, grad_outputs=I, retain_graph=retain_graph, create_graph=True, # for higher order derivatives ) return J
However, for the forward-mode version I can’t seem to get
autograd to behave. Somehow, I cannot seem to get it to accept the identity matrix as an argument for
grad_outputs and NOT return a sum of the columns of the final Jacobian I’d like. But given the success of the reverse-mode implementation, it seems like it should be easy!!
Here’s the current code, with a commented out version of what I’d like to do:
def fwd_jacobian(fxn, x, n_outputs, retain_graph): """ This implementation is very similar to the above, but with one twist. To implement a forward-mode AD with rev-mode calls, we first compute the rev-mode VJP for one vector (v) then we call d/dv(VJP) `n_outputs` times, one per basis vector, to obtain the Jacobian. This should be faster if `n_outputs` > "n_inputs" References ---------- . https://j-towns.github.io/2017/06/12/A-new-trick.html (Thanks to Jamie Townsend for this awesome trick!) """ xd = x.detach().requires_grad_(True) n_inputs = int(xd.size(0)) # first, compute *any* VJP v = torch.ones(n_outputs, device=x.device, requires_grad=True) y = fxn(xd.view(1,n_inputs)).view(n_outputs) if y.size(0) != n_outputs: raise ValueError('Function `fxn` does not give output ' 'compatible with `n_outputs`=%d, size ' 'of fxn(x) : %s' '' % (n_outputs, y.size(0))) vjp = torch.autograd.grad(y, xd, grad_outputs=v, create_graph=True, retain_graph=retain_graph) assert vjp.shape == (n_inputs,) # TODO somehow the repeat trick does not work anymore # now that we have to take derivatives wrt v # so loop over basis vectors and compose jacobian col by col I = torch.eye(n_inputs, device=x.device) J =  for i in range(n_inputs): Ji = autograd.grad(vjp, v, grad_outputs=I[i], retain_graph=retain_graph, create_graph=True, # for higher order derivatives ) J.append(Ji) return torch.stack(J).t()
Looping over each column of the final Jacobian is slow, of course, and wasteful – I’m computing the Jacobian N times, and throwing most of it away each time. Since this is the bottleneck in my code, it would be sweet to do it all in one go .
Hoping someone with more knowledge of the guts of autograd can advise on the proper use here. Thanks in advance!!
PS. I won’t post just yet for sake of brevity, but if anyone is interested I can post a test case as well to work from.