Difference between the 2 codes below (REINFORCE)


I was trying to implement the REINFORCE algorithm from scratch and somehow the policy was not improving at all. I will cut the details and will post the culprit part of the codes (mine vs code from pytorch examples). Can someone please tell me what is the difference between the codes below. The seeds and network is same

Code That Does Not Work

class Policy(nn.Module):
    def __init__(self):
        super(Policy, self).__init__()
        self.network = nn.Sequential(nn.Linear(4, 128), nn.ReLU(), nn.Linear(128, 2), nn.Softmax())
        self.log_probs = []
self.rewards = []
def forward(self, state):
        temp_state = torch.from_numpy(state).float().unsqueeze(0)
        return self.network(temp_state)

    def train(self):
        exp_return = 0
        returns = []
        #policy_loss = []
        for reward in self.rewards[::-1]:
            exp_return= reward + 0.99*exp_return
            returns.insert(0, exp_return)
        returns = torch.tensor(returns)
        policy_loss = Variable((returns * torch.tensor(self.log_probs)).sum(), requires_grad=True)
        self.rewards = []
        self.log_probs = []

    def policy_add_reward(self, reward):

    def select_action(self, state):
        probs = self(state)
        m = Categorical(probs)
        action = m.sample()
        return action.item()

Code that Does Work
That can be found here

I have added a debugger and seen the policy_loss for both of the codes. Since the seed was same, the loss was consistent.

I’m not familiar with REINFORCE, but from the code snippet it looks like you shouldn’t re-wrap your policy_loss into a new Variable, as this will detach your computation graph.
Could you just skip the Variable creation and just perform the mul and sum operation?

Not related to the problem, but in the current stable release, Variables and tensors were merged besides a lot of other bug fixes and new features. Have a look at the website for the install instructions.

I have tried that as well, it does not work either. The code iterates very fast though. I think, something is disconnected from the computation graph. Is there a way to check that?

In the first part you create a tensor of the log_probs. I think at this point you create a tensor with empty history since your returns have no history as well. In the second part you have (probably) saved the log_probs externally as tensor (including the gradient path). This is why you do not have a very deep gradient history in your first snippet, I suppose.

Thank you for the reply, both codes are exactly same outside this function

Then this could be a problem. The idea of reinforcement learning is that you use the gradient path of your predictions (in your case the log_probs) to propagate the reward which was multiplied with the predictions. If you don’t have the gradient path for the predictions (as ist seems to be the case in your code snippets) you cannot successfully propagate gradients through the network?

Have you monitored the model’s parameters using plain SGD for optimization? Do they change?

I am new to pytorch, I am not aware of how to do that.

Can you maybe post a bit more code or link a repository, so that we could have a look at your whole model?

yeah sure. Give me 5 mins

Hi, I have edited my code in the question. The correct code is also added in the edit. Thank you

so the gradients are None

The fact that there aren’t any grads is a strong hint towards my suggestion above. If I have some time tomorrow, I’ll try to get your code working

1 Like

The following runs with pytorch 0.4 (note that there are no more variables needed since they have been merged with tensors in 0.4) . For lower versions you have to do some minor changes.

import torch
from torch import nn
from torch.distributions import Categorical
import gym

class Policy(nn.Module):
    """model definition: Simple Network with Linear Layers and 2 Outputs"""
    def __init__(self):
        super(Policy, self).__init__()
        # actual network
        self.network = nn.Sequential(nn.Linear(4, 128), nn.ReLU(), nn.Linear(128, 2), nn.Softmax(dim=0))

        # lists to store log_probs and the corresponding rewards
        self.saved_log_probs = []
        self.rewards = []

    def forward(self, state):
        # propagate states through network (PyTorch automatically saves their gradient path and intermediate results)
        return self.network(state)

def select_action(model, state):
    Function to select the actual action upon a model's decision
    :param model: model which predicts next action
    :param state: current state (on which to react)
    probs = model(state)
    m = Categorical(probs)
    action = m.sample()
    # save log_probs as tensor
    return action.item()

def update_model(model: Policy, optimizer):
    Function to update the model's parameters by the gradients of the saved results (rewards  and saved log_probs)
    :param model:
    :param optimizer:
    exp_return = 0
    returns = []
    policy_loss = []

    # calculate rewards
    for reward in model.rewards[::-1]:
        exp_return = reward + 0.99 * exp_return
        returns.insert(0, exp_return)

    # multiply rewards with saved log_probs (tensors) to use their gradient path
    for idx, _reward in enumerate(returns):

    # add batch dimension, concatenate the list entries and sum them up for a total loss
    summed_policy_loss = torch.cat([tmp.unsqueeze(0) for tmp in policy_loss]).sum()

    # actual weight update

    # empty saved rewards and saved log_probs
    model.saved_log_probs, model.rewards = [], []

def train(render=False):
    Major train routine
    :param render: whether or not to render the environment

    # create environment
    env = gym.make('CartPole-v0')

    # create device (run on GPU if possible)
    if torch.cuda.is_available():
        device = torch.device("cuda")
        device = torch.device("cpu")

    # create optimizer and model (and push model to according device)
    policy = Policy().to(device)
    optimizer = torch.optim.SGD(policy.parameters(), lr=1e-3)

    # iterate through episodes (specifiy max_episodes here)
    episode = 1
    max_episodes = None

    while episode:
        # get current state from environment
        state = env.reset()

        # play a sequence (maximum of 10000 actions)
        for t in range(10000):
            # create tensor from state and push it to same device as model
            state_tensor = torch.from_numpy(state).to(torch.float).to(device)

            # select the action for each state
            action = select_action(policy, state_tensor)

            # execute action, get reward, new state and whether the sequence can be continued
            # (whether pole did not topple down)
            state, reward, done, _ = env.step(action)

            # render the environment if necessary
            if render:

            # save reward for current state

            # breaking condition (break if pole toppled over)
            if done:

        # update model by previous rewards and log_probs (saved in model)
        update_model(policy, optimizer)

        # optional: print weights of networks's first layer to see if parameters changed
        # (if they change the gradient path is okay)
        # print(policy.network[0].weight)

        # breaking condition for number of episodes
        if max_episodes is not None and episode >= max_episodes:

        # move to next episode
        episode += 1

if __name__ == '__main__':

EDIT: Just noticed that the code is pretty similar to the one which is given as example in the pytorch repo. But I hope the explanations are helpful.


Thank you so much for the time you have taken out to write all the code and I appreciate it a lot. The question, I still have it is that why doesn’t my code calculates the gradients. Why the code written below works while they way i wrote didn’t especially when the policy loss is same for both of them. Thank you

multiply rewards with saved log_probs (tensors) to use their gradient path

for idx, _reward in enumerate(returns):

The difference is that you only saved the data and not the tensors themselves. The gradient path is however stored in the tensor class and thus saving the data and creating a tensor again is not sufficient as the gradient path will vanish

1 Like

ohhhhh. I got it. Thank you

So, I tried this code, the problem I am facing is that, when I run it on GPU, this code does not train at all. Can you please tell me that why it is behaving that way. I tried it on the CPU with of course, a different optimizer, with same seeds, it is giving me same answers which is good but on GPU, it does not work at all. Is there something I am missing? Shouldn’t be the result same since the seed is fixed?

So you switched the optimizer between the CPU and the GPU version? Results with different optimizers are not exactly comparable. You should also note that the GPU is non-deterministic by default. You may switch that setting with

torch.backends.cudnn.deterministic = True

After importing and seeding pytorch (this will. Slow down your code a bit). Can you also try it with plain SGD (and the exactly same parameters) on GPU and CPU and post the results?

1 Like

I have used same optimizer for both of the models. Let me try what you have suggested. Again thank you so much for your effort. I appreciate it a lot