Are there any examples of applying forward mode AD in a real scenario? For example, training a 3-layer MLP on the MNIST data? I am surprised that I can not find anything on this topic,
Even in the documentation in the simple Linear example. I am understanding the output as the computed values of the derivative but this is for the final output how do we get the intermediate calculations as well? I want to assign these to .grad and then use an optimizer:
Hi Ryan!
I am not aware of any (but there could be something somewhere).
The reason is that “real” use cases for forward-mode AD are quite specialized.
In standard backward-mode autograd, a single forward / backward pass computes the
gradient of a single scalar (loss) (or more generally, the vector-jacobian product for a
single vector in the output space) with respect to a bunch of parameters.
On the other hand, a single forward-mode forward pass computes the derivatives of a
bunch of output values with respect to a single upstream value (or more generally, the
directional derivative of the output values with respect to a single direction in upstream
parameter space – that is, the jacobian-vector product for a single vector in upstream
parameter space).
Because you normally want the gradient of a single scalar loss with respect to all of your
trainable parameters, you normally use backward-mode autograd, which gives this to you
with one forward / backward pass. To get the full gradient with forward-mode autograd,
you would have to perform many forward-mode forward passes, one for each individual
scalar value making up the parameters with respect to which the gradient is being computed.
To train a 3-layer multi-layer perceptron, you would typically compute a single scalar loss
(most likely CrossEntropyLoss
) and compute its gradient with respect to all of the model’s
parameters using a single forward / backward pass. Let’s say your model consists of
1000 individual scalar parameters (these being the elements of the weight
and bias
tensors of the model’s Linear
layers). You would have to perform 1000 forward-mode
passes to compute the full gradient – a vastly more expensive proposition.
You certainly could use forward mode to compute this gradient – at great cost – and then
follow the rest of the standard workflow, performing gradient-descent optimization to train
the model.
But there is no practical reason to do things this way, so there is little motivation to write
examples that illustrate such a scheme.
Best.
K. Frank