About the sign of gradients from token probability w.r.t. intermediate activations during inference

Zhao_Xin · January 11, 2026, 3:04am

I have a question about how to correctly interpret the sign of gradients computed during inference, when the dependent variable is token probability after the softmax layer (not loss) and the variable of interest is an intermediate activation in Transformer FFN layers.

Here is my calculation setup:

Model: llama-style Transformer
Activation: I am looking at the intermidate activation output in FFN

Here is the code I do the calculation:


grads = torch.autograd.grad(
    outputs=probs,
    inputs=activations,
    grad_outputs=torch.ones_like(probs),
    retain_graph=True
)

Here:

probs is the predicted probability of a specific target token
activations are intermediate activations (e.g., from an MLP up-projection layer output)

To check the gradients, I manually increase or decrease the activation values and observe the change in the target token probability.

Empirically, I find that:

Increasing an activation does not always have the same effect on probability for neurons with different activation signs
The observed effect seems correlated with the sign of the activation, but not in a simple one-to-one way with the raw gradient sign

I am wondering:

Why might the activation sign influence the observed probability change, even though the gradient is computed with respect to additive changes?
Is this expected behavior when computing gradients of probabilities w.r.t. intermediate activations?

Any insight would be appreciated. Thanks!

KFrank · January 12, 2026, 2:37pm

Hi Xin!

(Note, just to be clear, because of grad_outputs = torch.ones_like (probs), you are
computing the gradient of the scalar probs.sum() with respect to activations.)

If probs are computed from activations using properly-differentiable functions (such
as built-in pytorch functions), then autograd will compute correct gradients, up to some
reasonable round-off error (that might be amplified if the computation is somehow
ill-conditioned).

You should be able to reproduce autograd’s gradients numerically as you describe – that is,
by observing the (small) change in one of the probs produced by a (small) change in one of
the activations. The signs of the activations play no special role here – they’re just part
of the overall values of the activations.

However, such “numerical differentiation” can be a bit delicate. The change you make to
activations has to be small enough that you probe mostly just the first derivative of probs
with respect to activations without picking up too much of higher-order derivatives. But
the change you make has to be large enough that the change produced in probs is
comfortably larger than round-off error.

Watch how your numerical estimate of the derivative varies as you decrease the size of the
change to you make to activations. You should see your estimate start to converge to the
correct first-order derivative (for example, as computed by autograd) as the contribution of
higher-order derivatives diminishes, but then get “noisy” and become incorrect as the round-off
error in the finite difference starts to dominate.

(As an aside, you can cancel out the contribution of the second derivative to your numerical
estimate of the first derivative by computing f (x + eps / 2) - f (x - eps / 2), rather
than f (x + eps) - f (x).)

Such numerical differentiation would be best performed in double (or higher) precision.

Best.

K. Frank