import torch as T
import torch.nn.functional as F
f = T.tensor([0.8,0.5,0.4])
f2 = F.softmax(f)
print(f2)
action_probs = T.distributions.Categorical(f2)
print(action_probs)
action = action_probs.sample()
print(action)

I can not understand the logic of T.distributions.Categorical. The neural network gives me three actions. Softmax gives probabilities. In theory, I choose the maximum (argmax). But I can’t understand why:

I can’t understand how the connection between the output of the neural network and action = action_probs.sample () is. If the neural network gives the maximum probability for 0, and the action gives 1. What is the point?

The point for the distributions package is to actually sample from distributions, not get the element with max probability.
If you only want the element with max probability, you should use max. If you want to actually sample from the probability that is defined with the weights you have, use distributions.

You should look into the REINFORCE algorithm for that.
But from what I remember, the gist is that the reward should be the expected reward by following the probs that you computed.
If you take the argmax, you don’t have the expected reward of your computed probs and so you don’t have an unbiased estimate of the gradients.

Well, logically, if you think about it. The neural network outputs data, I take the maximum value. But if I use action = m.sample (), then I get the action is not clear by what distribution.