Read this entry:
Examine the following:
from torch.distributions import Beta
m = Beta(torch.tensor([5.0]), torch.tensor([5.0]))
For discrete probabilities one expects -p*log(p) to always integrate to a positive number.
Many RL algorithms contain policies defined over discrete action domains and most (e.g., A2C, PPO, etc.) contain an entropy component in their objective function.
When entropy is positive, we expect D-RL to train policy distribution towards a distribution with less entropy as successful action distribution is discovered (assuming we start with maximum entropy).
Is this trend during training still the case when one uses continuous probabilities (assuming there’s some kind of convergence)?
Mathematical intuition says that the answer to this question should be positive . . . However, are there any RL problems where this may not the guaranteed . . . I’m asking this question in the RL group simply out of curiosity.
(By the way, I’ve also noticed – and I may be wrong here – that the entropy beta has no impact on the solution when dealing with continuous action spaces in pytorch whose distribution is modeled with pytorch’s beta distribution . . . Could it be that with pytorch’s beta distribution, the result of the entropy computation is already “detached”? . . . Changing the value of the entropy beta seems to make no difference to the solution . . . When the same action space is discretized into a categorical, this is not the case and entropy beta makes a difference. I hope my use of “entropy beta” in this parenthesis is not confused with the “beta distribution” or its entropy. Entropy beta is usually the multiplier of action distribution’s entropy in the rl algorithms’ objective.)