Hi,

I was trying to implement the REINFORCE algorithm from scratch and somehow the policy was not improving at all. I will cut the details and will post the culprit part of the codes (mine vs code from pytorch examples). Can someone please tell me what is the difference between the codes below. The seeds and network is same

**Code That Does Not Work**

```
class Policy(nn.Module):
def __init__(self):
super(Policy, self).__init__()
self.network = nn.Sequential(nn.Linear(4, 128), nn.ReLU(), nn.Linear(128, 2), nn.Softmax())
self.log_probs = []
self.rewards = []
def forward(self, state):
temp_state = torch.from_numpy(state).float().unsqueeze(0)
return self.network(temp_state)
def train(self):
exp_return = 0
returns = []
#policy_loss = []
for reward in self.rewards[::-1]:
exp_return= reward + 0.99*exp_return
returns.insert(0, exp_return)
returns = torch.tensor(returns)
policy_loss = Variable((returns * torch.tensor(self.log_probs)).sum(), requires_grad=True)
optimizer.zero_grad()
policy_loss.backward()
optimizer.step()
self.rewards = []
self.log_probs = []
def policy_add_reward(self, reward):
self.rewards.append(reward)
def select_action(self, state):
probs = self(state)
m = Categorical(probs)
action = m.sample()
self.log_probs.append(-m.log_prob(action))
return action.item()
```

**Code that Does Work**

That can be found here

I have added a debugger and seen the policy_loss for both of the codes. Since the seed was same, the loss was consistent.

Iâ€™m not familiar with REINFORCE, but from the code snippet it looks like you shouldnâ€™t re-wrap your `policy_loss`

into a new `Variable`

, as this will detach your computation graph.

Could you just skip the `Variable`

creation and just perform the `mul`

and `sum`

operation?

Not related to the problem, but in the current stable release, `Variables`

and `tensors`

were merged besides a lot of other bug fixes and new features. Have a look at the website for the install instructions.

I have tried that as well, it does not work either. The code iterates very fast though. I think, something is disconnected from the computation graph. Is there a way to check that?

In the first part you create a tensor of the `log_probs`

. I think at this point you create a tensor with empty history since your `returns`

have no history as well. In the second part you have (probably) saved the log_probs externally as tensor (including the gradient path). This is why you do not have a very deep gradient history in your first snippet, I suppose.

Thank you for the reply, both codes are exactly same outside this function

Then this could be a problem. The idea of reinforcement learning is that you use the gradient path of your predictions (in your case the `log_probs`

) to propagate the reward which was multiplied with the predictions. If you donâ€™t have the gradient path for the predictions (as ist seems to be the case in your code snippets) you cannot successfully propagate gradients through the network?

Have you monitored the modelâ€™s parameters using plain SGD for optimization? Do they change?

I am new to pytorch, I am not aware of how to do that.

Can you maybe post a bit more code or link a repository, so that we could have a look at your whole model?

yeah sure. Give me 5 mins

Hi, I have edited my code in the question. The correct code is also added in the edit. Thank you

so the gradients are None

The fact that there arenâ€™t any grads is a strong hint towards my suggestion above. If I have some time tomorrow, Iâ€™ll try to get your code working

1 Like

The following runs with pytorch 0.4 (note that there are no more variables needed since they have been merged with tensors in 0.4) . For lower versions you have to do some minor changes.

```
import torch
torch.manual_seed(42)
from torch import nn
from torch.distributions import Categorical
import gym
class Policy(nn.Module):
"""model definition: Simple Network with Linear Layers and 2 Outputs"""
def __init__(self):
super(Policy, self).__init__()
# actual network
self.network = nn.Sequential(nn.Linear(4, 128), nn.ReLU(), nn.Linear(128, 2), nn.Softmax(dim=0))
# lists to store log_probs and the corresponding rewards
self.saved_log_probs = []
self.rewards = []
def forward(self, state):
# propagate states through network (PyTorch automatically saves their gradient path and intermediate results)
return self.network(state)
def select_action(model, state):
"""
Function to select the actual action upon a model's decision
:param model: model which predicts next action
:param state: current state (on which to react)
"""
probs = model(state)
m = Categorical(probs)
action = m.sample()
# save log_probs as tensor
model.saved_log_probs.append(-m.log_prob(action))
return action.item()
def update_model(model: Policy, optimizer):
"""
Function to update the model's parameters by the gradients of the saved results (rewards and saved log_probs)
:param model:
:param optimizer:
:return:
"""
exp_return = 0
returns = []
policy_loss = []
# calculate rewards
for reward in model.rewards[::-1]:
exp_return = reward + 0.99 * exp_return
returns.insert(0, exp_return)
# multiply rewards with saved log_probs (tensors) to use their gradient path
for idx, _reward in enumerate(returns):
policy_loss.append(_reward*model.saved_log_probs[idx])
# add batch dimension, concatenate the list entries and sum them up for a total loss
summed_policy_loss = torch.cat([tmp.unsqueeze(0) for tmp in policy_loss]).sum()
# actual weight update
optimizer.zero_grad()
summed_policy_loss.backward()
optimizer.step()
# empty saved rewards and saved log_probs
model.saved_log_probs, model.rewards = [], []
def train(render=False):
"""
Major train routine
:param render: whether or not to render the environment
"""
# create environment
env = gym.make('CartPole-v0')
env.seed(42)
# create device (run on GPU if possible)
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
# create optimizer and model (and push model to according device)
policy = Policy().to(device)
optimizer = torch.optim.SGD(policy.parameters(), lr=1e-3)
# iterate through episodes (specifiy max_episodes here)
episode = 1
max_episodes = None
while episode:
# get current state from environment
state = env.reset()
# play a sequence (maximum of 10000 actions)
for t in range(10000):
# create tensor from state and push it to same device as model
state_tensor = torch.from_numpy(state).to(torch.float).to(device)
# select the action for each state
action = select_action(policy, state_tensor)
# execute action, get reward, new state and whether the sequence can be continued
# (whether pole did not topple down)
state, reward, done, _ = env.step(action)
# render the environment if necessary
if render:
env.render()
# save reward for current state
policy.rewards.append(reward)
# breaking condition (break if pole toppled over)
if done:
break
# update model by previous rewards and log_probs (saved in model)
update_model(policy, optimizer)
# optional: print weights of networks's first layer to see if parameters changed
# (if they change the gradient path is okay)
# print(policy.network[0].weight)
# breaking condition for number of episodes
if max_episodes is not None and episode >= max_episodes:
break
# move to next episode
episode += 1
if __name__ == '__main__':
train(True)
```

EDIT: Just noticed that the code is pretty similar to the one which is given as example in the pytorch repo. But I hope the explanations are helpful.

2 Likes

Thank you so much for the time you have taken out to write all the code and I appreciate it a lot. The question, I still have it is that why doesnâ€™t my code calculates the gradients. Why the code written below works while they way i wrote didnâ€™t especially when the policy loss is same for both of them. Thank you

# multiply rewards with saved log_probs (tensors) to use their gradient path

```
for idx, _reward in enumerate(returns):
policy_loss.append(_reward*model.saved_log_probs[idx])
```

The difference is that you only saved the data and not the tensors themselves. The gradient path is however stored in the tensor class and thus saving the data and creating a tensor again is not sufficient as the gradient path will vanish

1 Like

ohhhhh. I got it. Thank you

So, I tried this code, the problem I am facing is that, when I run it on GPU, this code does not train at all. Can you please tell me that why it is behaving that way. I tried it on the CPU with of course, a different optimizer, with same seeds, it is giving me same answers which is good but on GPU, it does not work at all. Is there something I am missing? Shouldnâ€™t be the result same since the seed is fixed?

So you switched the optimizer between the CPU and the GPU version? Results with different optimizers are not exactly comparable. You should also note that the GPU is non-deterministic by default. You may switch that setting with

```
torch.backends.cudnn.deterministic = True
```

After importing and seeding pytorch (this will. Slow down your code a bit). Can you also try it with plain SGD (and the exactly same parameters) on GPU and CPU and post the results?

1 Like

I have used same optimizer for both of the models. Let me try what you have suggested. Again thank you so much for your effort. I appreciate it a lot