Policy Gradient using loss as reward

Given an input, I want to predict the right output using policy gradient. The input is a number between 0-9 and the output is also a number between 0-9. I defined the reward to be -abs(input-output) which is some kind of loss function, and using gradient ascent I hope the numbers to match but unfortunately, they don’t.
I’m using the pytorch REINFORCE implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.nn.utils as utils
from torch.autograd import Variable
import numpy as np

class Policy(nn.Module):
    def __init__(self, hidden_size, num_inputs, action_space):
        super(Policy, self).__init__()
        self.action_space = action_space
        num_outputs = action_space

        self.linear1 = nn.Linear(num_inputs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, num_outputs)

    def forward(self, inputs):
        x = inputs
        x = F.relu(self.linear1(x))
        action_scores = self.linear2(x)
        return F.softmax(action_scores)

    def __init__(self, hidden_size, num_inputs, action_space, gamma=0.99):
        self.action_space = action_space
        self.model = Policy(hidden_size, num_inputs, action_space)
        self.model = self.model.cuda()
        self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3)
        self.gamma = gamma
        self.rewards = []
        self.log_probs = []
        self.entropies = []

    def select_action(self, state):
        probs = self.model(Variable(state).cuda())
        action = probs.multinomial().data.unsqueeze(0)
        prob = probs[:, action[0, 0]].view(1, -1)
        log_prob = prob.log()
        entropy = - (probs * probs.log()).sum()
        return action[0], log_prob, entropy

    def update_parameters(self):
        R = torch.zeros(1, 1)
        loss = 0
        for i in reversed(range(len(self.rewards))):
            R = self.gamma * R + self.rewards[i]
            loss = loss - (self.log_probs[i] * (Variable(R).expand_as(self.log_probs[i])).cuda()).sum() - (
                        0.0001 * self.entropies[i].cuda()).sum()
        loss = loss / len(self.rewards)

        utils.clip_grad_norm(self.model.parameters(), 40)
        self.rewards = []
        self.log_probs = []
        self.entropies = []

    def add_experience(self, reward, log_prob, entropy):

if __name__ == '__main__':
    reinforce = REINFORCE(128,1,10)
    while True:
        for i in range(10):
            input = torch.FloatTensor([i]).unsqueeze(0)
            output, log_prob, entropy = reinforce.select_action(input)
            reward = -torch.abs(output.type(torch.FloatTensor) - input)
            reinforce.add_experience(reward, log_prob, entropy)

What am I doing wrong?
Thanks :slight_smile:

So basically the network should learn 1 ? (input *1) = ouput

Exactly. Of course that eventually I want to do something else and I’m using this example just to see if it works (and it doesn’t).


three things I noticed (but I’m not an expert in RL)

  • I would expect update_parameters to be called less often. Should this be moved outside the for loop? I seem to associate policy gradient with notoriously high variance, so deriving one update from multiple trials might work better.
  • I would clamp the probabilities away from zero before taking logs. I needed this to avoid the device asserts.
  • Are you sure the “R” formula does what you want here? For a bandit type prediction, this seems somewhat odd.

Changing these, the network appears to learn something (being mostly right or off only by 1) although the results won’t be perfect even after many iterations.

Best regards


Thank you!
I tried changing the code according to your notes and I get the correct answer only 50%-60% of the times, Is it what you’re getting? if not, can you put the code after the changes here?
btw, you’re right, there’s no need for the R formula in this case.

Do you also get closer for the wrong ones? The reward suggests that one off is better than further away.

So here are the bits I changed:

  • The learning rate in the optimizer instantiation (I did not play much with it)
        self.optimizer = optim.Adam(self.model.parameters(), lr=3e-3)
  • in select_action the regularisation of the log
        log_prob = prob.clamp(min=1e-6).log()
        entropy = - (probs * probs.clamp(min=1e-6).log()).sum()
  • in update_parameters the “multiple independent episodes” reward (ideally, one would allow several multi-step episodes, but here is independence)
            R = self.rewards[i]
  • and finally I changed the inner loop to update parameters after 50 parameters (note the % 10 to only have inputs 0…9), put in 10001 outer loops and a diagnostic print every 500 outer loops
    reinforce = REINFORCE(128,1,10)
    for j in range(10001):
        for i in range(50):
            input = torch.FloatTensor([i%10]).unsqueeze(0)
            output, log_prob, entropy = reinforce.select_action(input)
            reward = -torch.abs(output.type(torch.FloatTensor) - input)
            if j%500==0 and i < 10:
                print (j, input[0,0], output[0,0], reward[0,0])
            reinforce.add_experience(reward, log_prob, entropy)


With that I go from

0 0.0 0 -0.0
0 1.0 4 -3.0
0 2.0 8 -6.0
0 3.0 4 -1.0
0 4.0 4 -0.0
0 5.0 4 -1.0
0 6.0 8 -2.0
0 7.0 0 -7.0
0 8.0 4 -4.0
0 9.0 8 -1.0


10000 0.0 0 -0.0
10000 1.0 1 -0.0
10000 2.0 1 -1.0
10000 3.0 3 -0.0
10000 4.0 4 -0.0
10000 5.0 5 -0.0
10000 6.0 5 -1.0
10000 7.0 8 -1.0
10000 8.0 8 -0.0
10000 9.0 8 -1.0

so 60/40 correct and one off.

I must admit I didn’t try the usual tricks beyond the learning rate to try and make it work better (so reward scaling seems one, but I know little about RL theory at the moment and even less about practice except that Alex Irpan says its hard).

Best regards


1 Like

I get -1 mostly for the wrong ones but this is not what bothers me. What bothers me is that the probability of the correct answer is very low comparing to the one selected, is it normal?
In addition, now that I tried the same thing with 100 inputs and outputs, for some reason, the probabilities are the same for all inputs and as a result I get the same output for all of them.