I have a question about how to correctly interpret the sign of gradients computed during inference, when the dependent variable is token probability after the softmax layer (not loss) and the variable of interest is an intermediate activation in Transformer FFN layers.
Here is my calculation setup:
Model: llama-style Transformer Activation: I am looking at the intermidate activation output in FFN
(Note, just to be clear, because of grad_outputs = torch.ones_like (probs), you are
computing the gradient of the scalar probs.sum() with respect to activations.)
If probs are computed from activations using properly-differentiable functions (such
as built-in pytorch functions), then autograd will compute correct gradients, up to some
reasonable round-off error (that might be amplified if the computation is somehow
ill-conditioned).
You should be able to reproduce autograd’s gradients numerically as you describe – that is,
by observing the (small) change in one of the probs produced by a (small) change in one of
the activations. The signs of the activations play no special role here – they’re just part
of the overall values of the activations.
However, such “numerical differentiation” can be a bit delicate. The change you make to activations has to be small enough that you probe mostly just the first derivative of probs
with respect to activations without picking up too much of higher-order derivatives. But
the change you make has to be large enough that the change produced in probs is
comfortably larger than round-off error.
Watch how your numerical estimate of the derivative varies as you decrease the size of the
change to you make to activations. You should see your estimate start to converge to the
correct first-order derivative (for example, as computed by autograd) as the contribution of
higher-order derivatives diminishes, but then get “noisy” and become incorrect as the round-off
error in the finite difference starts to dominate.
(As an aside, you can cancel out the contribution of the second derivative to your numerical
estimate of the first derivative by computing f (x + eps / 2) - f (x - eps / 2), rather
than f (x + eps) - f (x).)
Such numerical differentiation would be best performed in double (or higher) precision.