I am facing a scenario of a dynamics model in RL setting. A simple MLP represents the dynamics model, we need to predicts rollout distribution. Concretely,
Step 0: Randomly draw 10 samples (initial states)
Iteration t = 1 to T:
(1)Feed samples in MLP, we obtain 10 outputs
(2)Compute the mean and standard deviation (i.e. fit a Gaussian) and then sample 10 new results (discard the old ones) from
The dynamics model in each iteration is sharing parameters.
Given some differentiable cost function
cost(x), the objective (loss funnction) is the summation of average costs of the samples in each iteration, i.e.
J = mean(cost(x_0)) + ... + mean(cost(x_T))
x_t is a set of 10 samples at the iteration
I am a bit confused about how to deal with the backpropagation with
.reinforce(), since it is different scenario with policy gradient, where the analytic gradient is log-probability of action multiplied by reward, so that the action of type