Policy Gradient using loss as reward

Shani_Gamrian · February 21, 2018, 10:41am

Given an input, I want to predict the right output using policy gradient. The input is a number between 0-9 and the output is also a number between 0-9. I defined the reward to be -abs(input-output) which is some kind of loss function, and using gradient ascent I hope the numbers to match but unfortunately, they don’t.
I’m using the pytorch REINFORCE implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.nn.utils as utils
from torch.autograd import Variable
import numpy as np


class Policy(nn.Module):
    def __init__(self, hidden_size, num_inputs, action_space):
        super(Policy, self).__init__()
        self.action_space = action_space
        num_outputs = action_space

        self.linear1 = nn.Linear(num_inputs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, num_outputs)

    def forward(self, inputs):
        x = inputs
        x = F.relu(self.linear1(x))
        action_scores = self.linear2(x)
        return F.softmax(action_scores)


class REINFORCE:
    def __init__(self, hidden_size, num_inputs, action_space, gamma=0.99):
        torch.manual_seed(0)
        np.random.seed(0)
        self.action_space = action_space
        self.model = Policy(hidden_size, num_inputs, action_space)
        self.model = self.model.cuda()
        self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3)
        self.model.train()
        self.gamma = gamma
        self.rewards = []
        self.log_probs = []
        self.entropies = []

    def select_action(self, state):
        probs = self.model(Variable(state).cuda())
        action = probs.multinomial().data.unsqueeze(0)
        prob = probs[:, action[0, 0]].view(1, -1)
        log_prob = prob.log()
        entropy = - (probs * probs.log()).sum()
        return action[0], log_prob, entropy

    def update_parameters(self):
        R = torch.zeros(1, 1)
        loss = 0
        for i in reversed(range(len(self.rewards))):
            R = self.gamma * R + self.rewards[i]
            loss = loss - (self.log_probs[i] * (Variable(R).expand_as(self.log_probs[i])).cuda()).sum() - (
                        0.0001 * self.entropies[i].cuda()).sum()
        loss = loss / len(self.rewards)

        self.optimizer.zero_grad()
        loss.backward()
        utils.clip_grad_norm(self.model.parameters(), 40)
        self.optimizer.step()
        self.rewards = []
        self.log_probs = []
        self.entropies = []

    def add_experience(self, reward, log_prob, entropy):
        self.rewards.append(reward)
        self.log_probs.append(log_prob)
        self.entropies.append(entropy)


if __name__ == '__main__':
    reinforce = REINFORCE(128,1,10)
    while True:
        for i in range(10):
            input = torch.FloatTensor([i]).unsqueeze(0)
            output, log_prob, entropy = reinforce.select_action(input)
            reward = -torch.abs(output.type(torch.FloatTensor) - input)
            reinforce.add_experience(reward, log_prob, entropy)
            reinforce.update_parameters()

What am I doing wrong?
Thanks

lelouedec · February 21, 2018, 11:14am

So basically the network should learn 1 ? (input *1) = ouput

Shani_Gamrian · February 21, 2018, 11:29am

Exactly. Of course that eventually I want to do something else and I’m using this example just to see if it works (and it doesn’t).

tom · February 21, 2018, 8:22pm

Hi,

three things I noticed (but I’m not an expert in RL)

I would expect update_parameters to be called less often. Should this be moved outside the for loop? I seem to associate policy gradient with notoriously high variance, so deriving one update from multiple trials might work better.
I would clamp the probabilities away from zero before taking logs. I needed this to avoid the device asserts.
Are you sure the “R” formula does what you want here? For a bandit type prediction, this seems somewhat odd.

Changing these, the network appears to learn something (being mostly right or off only by 1) although the results won’t be perfect even after many iterations.

Best regards

Thomas

Shani_Gamrian · February 21, 2018, 8:41pm

Thank you!
I tried changing the code according to your notes and I get the correct answer only 50%-60% of the times, Is it what you’re getting? if not, can you put the code after the changes here?
btw, you’re right, there’s no need for the R formula in this case.

tom · February 22, 2018, 8:44am

Do you also get closer for the wrong ones? The reward suggests that one off is better than further away.

So here are the bits I changed:

The learning rate in the optimizer instantiation (I did not play much with it)

        self.optimizer = optim.Adam(self.model.parameters(), lr=3e-3)

in select_action the regularisation of the log

        log_prob = prob.clamp(min=1e-6).log()
        entropy = - (probs * probs.clamp(min=1e-6).log()).sum()

in update_parameters the “multiple independent episodes” reward (ideally, one would allow several multi-step episodes, but here is independence)

            R = self.rewards[i]

and finally I changed the inner loop to update parameters after 50 parameters (note the % 10 to only have inputs 0…9), put in 10001 outer loops and a diagnostic print every 500 outer loops

    reinforce = REINFORCE(128,1,10)
    for j in range(10001):
        for i in range(50):
            input = torch.FloatTensor([i%10]).unsqueeze(0)
            output, log_prob, entropy = reinforce.select_action(input)
            reward = -torch.abs(output.type(torch.FloatTensor) - input)
            if j%500==0 and i < 10:
                print (j, input[0,0], output[0,0], reward[0,0])
            reinforce.add_experience(reward, log_prob, entropy)
        reinforce.update_parameters()

.

With that I go from

0 0.0 0 -0.0
0 1.0 4 -3.0
0 2.0 8 -6.0
0 3.0 4 -1.0
0 4.0 4 -0.0
0 5.0 4 -1.0
0 6.0 8 -2.0
0 7.0 0 -7.0
0 8.0 4 -4.0
0 9.0 8 -1.0

to

10000 0.0 0 -0.0
10000 1.0 1 -0.0
10000 2.0 1 -1.0
10000 3.0 3 -0.0
10000 4.0 4 -0.0
10000 5.0 5 -0.0
10000 6.0 5 -1.0
10000 7.0 8 -1.0
10000 8.0 8 -0.0
10000 9.0 8 -1.0

so 60/40 correct and one off.

I must admit I didn’t try the usual tricks beyond the learning rate to try and make it work better (so reward scaling seems one, but I know little about RL theory at the moment and even less about practice except that Alex Irpan says its hard).

Best regards

Thomas

Shani_Gamrian · February 22, 2018, 10:12am

I get -1 mostly for the wrong ones but this is not what bothers me. What bothers me is that the probability of the correct answer is very low comparing to the one selected, is it normal?
In addition, now that I tried the same thing with 100 inputs and outputs, for some reason, the probabilities are the same for all inputs and as a result I get the same output for all of them.