The question concerns the torch.distributions implementation. This is the canonical example from the relase page,
probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
Usually, the probabilities are obtained from policy_network as a result of a softmax operation.
However, if we subsequently take a logarithm of the softmax probabilities, are not we loosing the numerical precision?
And if so, why the Categorical distribution is not implemented in the way that allows object construction with only logits provided?
you should put a LogSoftmax in the policy network’s last layer, not Softmax
If I do so, I will be required to subsequently call exp(log_probs) to match a signature of the torch.distributions.Categorical.
Also, the subsequent call to torch.distributions.Categorica(probs).log_prob(0) will also take the log, which, again, will cause all the numerical issues.
So, either I don’t understand the purpose of the current implementation of torch.distributions or the example on the github release page is misleading.
Also, probably something is weird with sampling from Categorical. After executing
Hi, I think the log_prob comes from the policy optimization algorithm (i.e., convert multiply to sum as here). The output of the policy network should still be a distribution over your action space.