Backpropagation through sampling a normal distribution

vvanirudh · May 17, 2017, 5:56pm

In my model, I predict the mean and variance of a normal distribution. To get the input at the next time-step, I need to sample from the normal distribution predicted previously. If I use torch.normal, when I call backward does it backprop through this operation?

tom · May 17, 2017, 9:42pm

Hello @vvanirudh,

no (unless you are doing reinforce & co). That is why the standard thing is to sample standard normal (torch.randn), multiply with the standard deviation and add the mean.
This is sometimes labeled the reparametrisation trick, see Kingma & Welling, Section 2.4.

Best regards

Thomas

zuoxingdong · September 19, 2017, 12:10am

Hi @tom

In case of using torch.normal, are there reasonable values to set .reinforce(reward) ? e.g. the gradients (backpropagated) in the next time step as the reward for each sample in current time step.

Concretely, say we have 50 initial samples, through NN, we get 50 outputs, then we compute mean and std, then generate 50 new samples, and feed this new samples through NN again, then we compute a loss.

beginner · April 5, 2018, 3:20am

Can anyone give a quick explanation as to why we can’t backpropagate through a sample?

tom · April 5, 2018, 9:31pm

If you consider loss minimization, you can consider the (negative) gradient as a hint “go this way with your parameters a tiny bit to reduce the loss”. Now if you can apply the reparametrisation trick, you can say that tweaking the parameters a little bit also moves the output by small amount. This relationship enables you to backpropagate.
However, if you have, say a bernoulli random variable (i.e. 1 with probability p, 0 with probablity (1-p)), you cannot “wiggle” with the outcomes - they’ll be 0 or 1.
The basic trick there is to recognize that you are interested in the expectation rather than individual sample and then to figure out how to exchange expectation and derivative for backpropagation and make an expression with an expectation from the result - this is a key to the REINFORCE-type methods.

Best regards

Thomas

haku · June 8, 2020, 12:38am

@tom I feel like it’s kind of off-topic,
but could you elaborate on the notion that

Is it because RL usually object to find action/policy/or-some-sort to maximize expected reward?

tom · June 8, 2020, 8:29pm

Yes.
So in gross simplification, we want to optimize E(R(a)), the expected (E) reward (R) that is a function of our action (a). This is similar to wishing to optimize the loss (except it is minimization) where we want to optimize E(L(p)) the expected (E) loss (L) of a prediction (p) in supervised learning.
Now in supervised learning, the loss and prediction is typically framed as something continuous, either a distance (in regression) or a score/probability (in classification) and you can differentiate E(L(p)) by p. In the RL methods tackled with REINFORCE-type methods, you typically have discrete actions are discrete and thus are non-differentiable. But you do have the ability to move the distribution of the actions, thus changing the “density” implicit in the expectation E. That is then differentiable.

Similar things are done elsewhere, too, e.g. in Mathematical Finance if you do Monte-Carlo pricing and want to estimate sensitivities, you can sometimes do path-wise derivatives (i.e. differentiate the integrand in the expectation) and sometimes you have to use the “Likelihood ratio method” (which is differentiating the density to get a derivative of the expectation). From the top of my head Glasserman, Monte Carlo Methods in Financial Engineering, treats this in chapter 7.

Best regards

Thomas