I am facing a scenario of a dynamics model in RL setting. A simple MLP represents the dynamics model, we need to predicts rollout distribution. Concretely,
Step 0: Randomly draw 10 samples (initial states)
Iteration t = 1 to T:
(1)Feed samples in MLP, we obtain 10 outputs
(2)Compute the mean and standard deviation (i.e. fit a Gaussian) and then sample 10 new results (discard the old ones) from torch.normal(mean, std)
The dynamics model in each iteration is sharing parameters.
Given some differentiable cost function cost(x)
, the objective (loss funnction) is the summation of average costs of the samples in each iteration, i.e.
J = mean(cost(x_0)) + ... + mean(cost(x_T))
where x_t
is a set of 10 samples at the iteration t
.
I am a bit confused about how to deal with the backpropagation with .reinforce()
, since it is different scenario with policy gradient, where the analytic gradient is log-probability of action multiplied by reward, so that the action of type Variable
calls .reinforce(r)
.