Explicitly Calculate Jacobian Matrix in Simple Neural Network

Torch provides API functional jacobian to calculate jacobian matrix.
In algorithms, like Levenberg-Marquardt, we need to get 1st-order partial derivatives of loss (a vector) w.r.t each weights(1-D or 2-D) and bias.
With the jacobian function, we can easily get:

torch.autograd.functional.jacobian(nn_func, inputs=inputs_tuple, vectorize=True)

It is fast but vectorize requires much memory. So, I am wondering is it possible to get 1st-order derivative explicitly in PyTorch?
i.e., calculate $\partial{L}/\partial{w_{i}}$ and $\partial{L}/\partial{b_{i}}$

I think the issue here is that you want a column of the Jacobian, but a single backward pass only computes single row, forcing you to compute the entire Jacobian (which is obviously inefficient). In this case you should be using forward-mode AD, which more directly computes $\partial{L}/\partial{w_{i}}$ (where L is vector-valued) in a single forward pass.

TLDR about forward AD is that we basically run the forward pass as usual except we associate with each tensor a perturbation tensor. We call the original tensor the “primal”, and the perturbation the “tangent”. The tuple as a whole, we refer to as the “dual” tensor. The idea is that if as we perform the forward pass as usual on the primal, we also perform the necessary computation on the tangent so that it would in effect propagate the perturbation and produce the desired partial derivative, i.e. compute “how sensitive are my outputs (to perturbations) relative to the inputs”.

For example:

# not tested
import autograd.forward_ad as fwAD

with fwAD.dual_level():
  tangent = torch.zeros_like(parameter)  # parameter is the primal here
  tangent[i] = 1  # set the value of tangent at the desired index to 1
  dual_in = fwAD.make_dual(parameter, tangent)
  dual_out = model(dual_in)  # slightly more tricky if you have a module or more params
  primal_out, tangent_out = fwAD.unpack_dual(dual_out)

print(tangent_out)   # dL/dw_i

The functionality for forward AD has only been added recently, and we are still working on improving operator coverage by adding more formulas, so if you are running into an operator where its forward AD formula is not yet implemented, let me know so we can prioritize your use case.

Aside: vectorization is also possible with fwAD (will be in master soon) allowing you to compute dL/dw_i and dL/db_i in single forward pass. It only computes the columns of the Jacobian that you need vs backward-AD case where you need to materialize the entire jacobian, so you shouldn’t have to worry about memory here.

Hi, many thanks! I will go on to test the suggested method.

Hi @soulitzer,

I was wondering if it’s at all possible to use forward mode AD in 1.10.dev to calculate the Hessian of a function using forward-over-reverse AD? So, computing the jacobian using forward-mode AD then using reverse-mode AD to get the Hessian. (like JAX currently supports it? - The Autodiff Cookbook — JAX documentation)

Thank you!

It is technically possible now, depending on the operators that your model uses. However, the real speed-up of forward-over-reverse Hessian comes from being able to vectorize over the forward (otherwise you’d have to compute the forward O(numel) times). The ability to compute vectorized jvp should be in master soon, but is not ready at the moment.

1 Like

Hi Soulitzer,

As 1.10 has been fully released now, I was wondering if there are any examples on how to calculate the Hessian of a nn.Module of the output w.r.t the input? Also, will this support batch-dimensions as well?

I do have a current example which uses reverse-over-reverse calculation of the Laplacian (so sum of the diagonal of the Hessian), would it be possible if I share that you could direct me in how to change that for forward-over-reverse calculation?

Thank you!