I am implementing a REINFORCE algorithm (as superficially explained over here) using a Categorical distribution (or Multinomial distribution as argued in the explanation of the Categorical distribution) for sampling actions.
I would like to know what the backpropagation rule looks like when plugging a Multinomial distribution into the REINFORCE update procedure.
More specifically, to say it in Williams’ words, I would like to know what characteristic eligibility looks like when using a Multinomial distribution during the training of a REINFORCE algorithm. I saw in the aforementioned paper (by Williams) how the eligibility associated with a Bernoulli distribution looks like, but I don’t know how to generalize that further.
From the explanation of distributions in PyTorch, I know that all this is implemented by default in PyTorch. However, I would still like to know how this theoretically works or if there are any papers describing the backpropagation procedure I am interested in.
Thank you very much in advance!