# Pytorch categorical distribution, probably a bug?

Hi everyone,

I tried to use `torch.distributions.Categorical` and do not understand, why these two methods of calculating loss and gradient do not deliver identical results, only losses are equal:

``````import torch
inp = torch.tensor( [[ 2 / 7, 4. / 7, 1 / 7 ]], requires_grad = True )
for a in range( 3 ):
action = torch.tensor( [a] )
m = torch.distributions.Categorical( probs=inp )
loss = -m.log_prob( action )
loss.backward()

for a in range( 3 ):
action = torch.tensor( [a] )
loss = torch.nn.NLLLoss()( inp.log(), action )
loss.backward()
``````

Output:
``` tensor([1.2528], grad_fn=) tensor([[-2.5000, 1.0000, 1.0000]]) tensor([0.5596], grad_fn=) tensor([[ 1.0000, -0.7500, 1.0000]]) tensor([1.9459], grad_fn=) tensor([[ 1.0000, 1.0000, -6.0000]]) tensor(1.2528, grad_fn=) tensor([[-3.5000, 0.0000, 0.0000]]) tensor(0.5596, grad_fn=) tensor([[ 0.0000, -1.7500, 0.0000]]) tensor(1.9459, grad_fn=) tensor([[ 0.0000, 0.0000, -7.0000]]) ```

I would prefer to be a fool, or is this a bug?

Best regards,
Roman

2 Likes

Great observation!
Itâ€™s a bit more subtle than a bug. While `Categorical.log_prob` returns a gradient with sum zero would leave you in the realm of probability measures when infinitesimally moving in that direction, `NLLLoss`'s backward has a constant offset that will happily catapult you outside the admissible input space.
In math lingo, `Categorical.log_prob` computes the gradient on the (not quite but locally if no probs are zero) manifold of probability vectors, which lies in the tangent space of that â€śmanifoldâ€ť while `NLLLoss` computes the gradient on the larger parameter space of the formula given in the documentation. The difference is orthogonal to the tangent space of probabilities.
The effect will go away if you force the input to `NLLLoss` to lie in the log-of-probability-measures manifold, i.e. change the line to

``````    loss = torch.nn.NLLLoss()( inp.log().log_softmax(1), action )
``````

Then the `log_softmax` does not change your input (it is a projection on log probability measures, so the identity on them) but its backward will do the projection on the tangent space (if you liked the math: because the normal component is in the kernel of the Jacobian) and youâ€™ll get gradients for `inp` that might better match your expectation.

Best regards

Thomas

3 Likes

Thanks Thomas, great explanation!

So I am a fool, a bit of at least

As I woke up after some hours of sleep and analyzing the source code I also had a suspicion,
where my error was, but You made me see it clear.

Allow me to notice, that the sum of the first gradients is also not equal to zero.
The shift of 1 between both groups can be explained with `logsoftmax()`,

Grads are orthogonal to `[ 2 / 7, 4. / 7, 1 / 7 ]`. And that is the point.
Prob manifold is a simplex, so the grads have to have zero sum of components,
to let me stay in the manifold.

What do you mean?

1 Like

Good catch! The difference must be constant (to be defined after de-meaning aka projecting onto the tangent space). Now, if PyTorch doesnâ€™t do this by default, I wonder if training would be better if it did. Some of those reports on the forumâ€¦ You can do this via register_hook on the input, Iâ€™ll certainly try that when I have the time.

Best regards

Thomas