I tried to use torch.distributions.Categorical and do not understand, why these two methods of calculating loss and gradient do not deliver identical results, only losses are equal:

import torch
inp = torch.tensor( [[ 2 / 7, 4. / 7, 1 / 7 ]], requires_grad = True )
for a in range( 3 ):
action = torch.tensor( [a] )
m = torch.distributions.Categorical( probs=inp )
loss = -m.log_prob( action )
loss.backward()
print( loss, inp.grad )
inp.grad.zero_()
for a in range( 3 ):
action = torch.tensor( [a] )
loss = torch.nn.NLLLoss()( inp.log(), action )
loss.backward()
print( loss, inp.grad )
inp.grad.zero_()

Great observation!
Itâ€™s a bit more subtle than a bug. While Categorical.log_prob returns a gradient with sum zero would leave you in the realm of probability measures when infinitesimally moving in that direction, NLLLoss's backward has a constant offset that will happily catapult you outside the admissible input space.
In math lingo, Categorical.log_prob computes the gradient on the (not quite but locally if no probs are zero) manifold of probability vectors, which lies in the tangent space of that â€śmanifoldâ€ť while NLLLoss computes the gradient on the larger parameter space of the formula given in the documentation. The difference is orthogonal to the tangent space of probabilities.
The effect will go away if you force the input to NLLLoss to lie in the log-of-probability-measures manifold, i.e. change the line to

loss = torch.nn.NLLLoss()( inp.log().log_softmax(1), action )

Then the log_softmax does not change your input (it is a projection on log probability measures, so the identity on them) but its backward will do the projection on the tangent space (if you liked the math: because the normal component is in the kernel of the Jacobian) and youâ€™ll get gradients for inp that might better match your expectation.

As I woke up after some hours of sleep and analyzing the source code I also had a suspicion,
where my error was, but You made me see it clear.

Allow me to notice, that the sum of the first gradients is also not equal to zero.
The shift of 1 between both groups can be explained with logsoftmax(),
but what about nonzero sums?

Grads are orthogonal to [ 2 / 7, 4. / 7, 1 / 7 ]. And that is the point.
Prob manifold is a simplex, so the grads have to have zero sum of components,
to let me stay in the manifold.

Good catch! The difference must be constant (to be defined after de-meaning aka projecting onto the tangent space). Now, if PyTorch doesnâ€™t do this by default, I wonder if training would be better if it did. Some of those reports on the forumâ€¦ You can do this via register_hook on the input, Iâ€™ll certainly try that when I have the time.