I am facing a scenario of a dynamics model in RL setting. A simple MLP represents the dynamics model, we need to predicts rollout distribution. Concretely,

Step 0: Randomly draw 10 samples (initial states)

Iteration t = 1 to T:

(1)Feed samples in MLP, we obtain 10 outputs

(2)Compute the mean and standard deviation (i.e. fit a Gaussian) and then sample 10 new results (discard the old ones) from `torch.normal(mean, std)`

The dynamics model in each iteration is sharing parameters.

Given some differentiable cost function `cost(x)`

, the objective (loss funnction) is the summation of average costs of the samples in each iteration, i.e.

`J = mean(cost(x_0)) + ... + mean(cost(x_T))`

where `x_t`

is a set of 10 samples at the iteration `t`

.

I am a bit confused about how to deal with the backpropagation with `.reinforce()`

, since it is different scenario with policy gradient, where the analytic gradient is log-probability of action multiplied by reward, so that the action of type `Variable`

calls `.reinforce(r)`

.