Hi, my issue is that I use a set of different batches to compute one (multinomial) distribution and one sampling per each batch, in a single epoch. How can I backpropagate those different distributions? I’m thinking on just creating a single Multinomial containing all the distribution vectors as a matrix and also get all the sampling vectors at once (as a matrix, again). Does it have sense?

I am trying to understand: you want to learn a policy composed by N (supposedly) independent multinomial distributions.

So, something like \pi(a_i = k | s) ~ p_i^k (1-p_i)^k up to a normalization for i=1…N

And you want to learn it as a single distribution: \pi(A=[k1..kn] | s) right?

In that case, assuming all distributions are independent, you can sum the log_prob of all distributions and back propagate this sum (times a reward). If all parameters are given to the optimizer, it should work.