In my model, I predict the mean and variance of a normal distribution. To get the input at the next time-step, I need to sample from the normal distribution predicted previously. If I use `torch.normal`

, when I call `backward`

does it backprop through this operation?

Hello @vvanirudh,

no (unless you are doing reinforce & co). That is why the standard thing is to sample standard normal (`torch.randn`

), multiply with the standard deviation and add the mean.

This is sometimes labeled the *reparametrisation trick*, see Kingma & Welling, Section 2.4.

Best regards

Thomas

Hi @tom

In case of using `torch.normal`

, are there reasonable values to set `.reinforce(reward)`

? e.g. the gradients (backpropagated) in the next time step as the reward for each sample in current time step.

Concretely, say we have 50 initial samples, through NN, we get 50 outputs, then we compute mean and std, then generate 50 new samples, and feed this new samples through NN again, then we compute a loss.

Can anyone give a quick explanation as to why we canâ€™t backpropagate through a sample?

If you consider loss minimization, you can consider the (negative) gradient as a hint â€śgo this way with your parameters a tiny bit to reduce the lossâ€ť. Now if you can apply the reparametrisation trick, you can say that tweaking the parameters a little bit also moves the output by small amount. This relationship enables you to backpropagate.

However, if you have, say a bernoulli random variable (i.e. 1 with probability p, 0 with probablity (1-p)), you cannot â€świggleâ€ť with the outcomes - theyâ€™ll be 0 or 1.

The basic trick there is to recognize that you are interested in the expectation rather than individual sample and then to figure out how to exchange expectation and derivative for backpropagation and make an expression with an expectation from the result - this is a key to the REINFORCE-type methods.

Best regards

Thomas

@tom I feel like itâ€™s kind of off-topic,

but could you elaborate on the notion that

Is it because RL usually object to find action/policy/or-some-sort to maximize expected reward?

Yes.

So in gross simplification, we want to optimize `E(R(a))`

, the expected (E) reward (`R`

) that is a function of our action (a). This is similar to wishing to optimize the loss (except it is minimization) where we want to optimize `E(L(p))`

the expected (E) loss (L) of a prediction (`p`

) in supervised learning.

Now in supervised learning, the loss and prediction is typically framed as something continuous, either a distance (in regression) or a score/probability (in classification) and you can differentiate `E(L(p))`

by `p`

. In the RL methods tackled with REINFORCE-type methods, you typically have discrete actions are discrete and thus are non-differentiable. But you do have the ability to move the distribution of the actions, thus changing the â€śdensityâ€ť implicit in the expectation `E`

. That is then differentiable.

Similar things are done elsewhere, too, e.g. in Mathematical Finance if you do Monte-Carlo pricing and want to estimate sensitivities, you can sometimes do path-wise derivatives (i.e. differentiate the integrand in the expectation) and sometimes you have to use the â€śLikelihood ratio methodâ€ť (which is differentiating the density to get a derivative of the expectation). From the top of my head Glasserman, Monte Carlo Methods in Financial Engineering, treats this in chapter 7.

Best regards

Thomas