Can I backpropagate different distributions at once using Policy Gradient?

Ignasi_Mas · January 21, 2019, 7:02pm

Hi, my issue is that I use a set of different batches to compute one (multinomial) distribution and one sampling per each batch, in a single epoch. How can I backpropagate those different distributions? I’m thinking on just creating a single Multinomial containing all the distribution vectors as a matrix and also get all the sampling vectors at once (as a matrix, again). Does it have sense?

Thank you in advanced.

alexis-jacq · January 31, 2019, 6:57pm

I am trying to understand: you want to learn a policy composed by N (supposedly) independent multinomial distributions.

So, something like \pi(a_i = k | s) ~ p_i^k (1-p_i)^k up to a normalization for i=1…N

And you want to learn it as a single distribution: \pi(A=[k1..kn] | s) right?

In that case, assuming all distributions are independent, you can sum the log_prob of all distributions and back propagate this sum (times a reward). If all parameters are given to the optimizer, it should work.