Vanilla REINFORCE for continuous distributions

I’m writing a baseline for my model based on REINFORCE. That means I expect it not to work very well, which it makes it difficult check whether my implementation is correct. I tried this simple script to check that I’ve understood how to do REINFORCE in Pytorch.

It trains an MLP to produce 4 simple curves (identity, square, cube and sin) on a 1D input. The output consists of 4 values (the means) and 4 variances, together making 4 1D Gaussians. I sample an output vector from this result, and apply REINFORCE to get a loss.

My question is simply, is this the standard way to apply reinforce for Normal distributions, and to distribute the loss over the batch? It seems to work for this simple example, but I need to make sure that I’m not crippling my baseline by misunderstanding something.

import torch
from torch import nn
from torch.autograd import Variable
import torch.nn.functional as F

batch = 64
iterations = 50000

# Two layer MLP, producing means and sigmas for the output
h = 128
model = nn.Sequential(
    nn.Linear(1, h), nn.Sigmoid(),
    nn.Linear(h, 8)

opt = torch.optim.Adam(model.parameters(), lr = 0.0005)

for i in range(iterations):

    x = torch.randn(batch, 1)
    y =[x, x ** 2, x ** 3, torch.sin(x)], dim=1)

    x, y = Variable(x), Variable(y)

    res = model(x)
    means, sigs = res[:, :4], torch.exp(res[:, 4:])

    dists = torch.distributions.Normal(means, sigs)
    samples = dists.sample()

    mloss = F.mse_loss(samples, y, reduce=False)
    loss = - dists.log_prob(samples) * - mloss
    loss = loss.mean()


    if i % 1000 == 0:
        print('{: 6} grad'.format(i), list(model.parameters())[0].grad.mean())
        print('      ', 'loss', F.mse_loss(,, reduce=False).mean(dim=0))
        print('      ', 'sigs', sigs.mean(dim=0))

I cannot recall what the REINFORCE is. Could you refer that method?

It’s the basic method behind policy gradient reinforcement learning. It’s referenced in the docs here:
This blogpost provides a more elaborate explanation:

It’s normally used in explicit reinforcement learning settings, when your network produces a distribution over possible actions and the environment provides you with a reward for the action. This is also how it is referenced in the docs. However, it can be used for any model that uses some non-differentiable sampling step. The principle is generalized in this paper.

My example is a little artificial: the model produces a distribution on the outputs and then samples from that distribution, computes a loss on the sample, and then estimates a gradient on that sample using REINFORCE.

Thx for referring. Could you define your problem with normal RL ways. I cannot figure out which are the state, action, reward in your example.

If you were to map this to an RL setting, then an “action” consists of choosing a real-valued 4D vector. The model produces a distribution over these actions in the form of 4 independent 1D Normal distributions, from which a single action is sampled.

The “reward” is just the negative of the loss, which is the MSE between the target (y) and the sampled action (samples).