I have tried that as well, it does not work either. The code iterates very fast though. I think, something is disconnected from the computation graph. Is there a way to check that?
In the first part you create a tensor of the
log_probs. I think at this point you create a tensor with empty history since your
returns have no history as well. In the second part you have (probably) saved the log_probs externally as tensor (including the gradient path). This is why you do not have a very deep gradient history in your first snippet, I suppose.
Thank you for the reply, both codes are exactly same outside this function
Then this could be a problem. The idea of reinforcement learning is that you use the gradient path of your predictions (in your case the
log_probs) to propagate the reward which was multiplied with the predictions. If you don’t have the gradient path for the predictions (as ist seems to be the case in your code snippets) you cannot successfully propagate gradients through the network?
Have you monitored the model’s parameters using plain SGD for optimization? Do they change?
I am new to pytorch, I am not aware of how to do that.
Can you maybe post a bit more code or link a repository, so that we could have a look at your whole model?
yeah sure. Give me 5 mins
Hi, I have edited my code in the question. The correct code is also added in the edit. Thank you
so the gradients are None
The fact that there aren’t any grads is a strong hint towards my suggestion above. If I have some time tomorrow, I’ll try to get your code working
The following runs with pytorch 0.4 (note that there are no more variables needed since they have been merged with tensors in 0.4) . For lower versions you have to do some minor changes.
import torch torch.manual_seed(42) from torch import nn from torch.distributions import Categorical import gym class Policy(nn.Module): """model definition: Simple Network with Linear Layers and 2 Outputs""" def __init__(self): super(Policy, self).__init__() # actual network self.network = nn.Sequential(nn.Linear(4, 128), nn.ReLU(), nn.Linear(128, 2), nn.Softmax(dim=0)) # lists to store log_probs and the corresponding rewards self.saved_log_probs =  self.rewards =  def forward(self, state): # propagate states through network (PyTorch automatically saves their gradient path and intermediate results) return self.network(state) def select_action(model, state): """ Function to select the actual action upon a model's decision :param model: model which predicts next action :param state: current state (on which to react) """ probs = model(state) m = Categorical(probs) action = m.sample() # save log_probs as tensor model.saved_log_probs.append(-m.log_prob(action)) return action.item() def update_model(model: Policy, optimizer): """ Function to update the model's parameters by the gradients of the saved results (rewards and saved log_probs) :param model: :param optimizer: :return: """ exp_return = 0 returns =  policy_loss =  # calculate rewards for reward in model.rewards[::-1]: exp_return = reward + 0.99 * exp_return returns.insert(0, exp_return) # multiply rewards with saved log_probs (tensors) to use their gradient path for idx, _reward in enumerate(returns): policy_loss.append(_reward*model.saved_log_probs[idx]) # add batch dimension, concatenate the list entries and sum them up for a total loss summed_policy_loss = torch.cat([tmp.unsqueeze(0) for tmp in policy_loss]).sum() # actual weight update optimizer.zero_grad() summed_policy_loss.backward() optimizer.step() # empty saved rewards and saved log_probs model.saved_log_probs, model.rewards = ,  def train(render=False): """ Major train routine :param render: whether or not to render the environment """ # create environment env = gym.make('CartPole-v0') env.seed(42) # create device (run on GPU if possible) if torch.cuda.is_available(): device = torch.device("cuda") else: device = torch.device("cpu") # create optimizer and model (and push model to according device) policy = Policy().to(device) optimizer = torch.optim.SGD(policy.parameters(), lr=1e-3) # iterate through episodes (specifiy max_episodes here) episode = 1 max_episodes = None while episode: # get current state from environment state = env.reset() # play a sequence (maximum of 10000 actions) for t in range(10000): # create tensor from state and push it to same device as model state_tensor = torch.from_numpy(state).to(torch.float).to(device) # select the action for each state action = select_action(policy, state_tensor) # execute action, get reward, new state and whether the sequence can be continued # (whether pole did not topple down) state, reward, done, _ = env.step(action) # render the environment if necessary if render: env.render() # save reward for current state policy.rewards.append(reward) # breaking condition (break if pole toppled over) if done: break # update model by previous rewards and log_probs (saved in model) update_model(policy, optimizer) # optional: print weights of networks's first layer to see if parameters changed # (if they change the gradient path is okay) # print(policy.network.weight) # breaking condition for number of episodes if max_episodes is not None and episode >= max_episodes: break # move to next episode episode += 1 if __name__ == '__main__': train(True)
EDIT: Just noticed that the code is pretty similar to the one which is given as example in the pytorch repo. But I hope the explanations are helpful.
Thank you so much for the time you have taken out to write all the code and I appreciate it a lot. The question, I still have it is that why doesn’t my code calculates the gradients. Why the code written below works while they way i wrote didn’t especially when the policy loss is same for both of them. Thank you
multiply rewards with saved log_probs (tensors) to use their gradient path
for idx, _reward in enumerate(returns): policy_loss.append(_reward*model.saved_log_probs[idx])
The difference is that you only saved the data and not the tensors themselves. The gradient path is however stored in the tensor class and thus saving the data and creating a tensor again is not sufficient as the gradient path will vanish
ohhhhh. I got it. Thank you
So, I tried this code, the problem I am facing is that, when I run it on GPU, this code does not train at all. Can you please tell me that why it is behaving that way. I tried it on the CPU with of course, a different optimizer, with same seeds, it is giving me same answers which is good but on GPU, it does not work at all. Is there something I am missing? Shouldn’t be the result same since the seed is fixed?
So you switched the optimizer between the CPU and the GPU version? Results with different optimizers are not exactly comparable. You should also note that the GPU is non-deterministic by default. You may switch that setting with
torch.backends.cudnn.deterministic = True
After importing and seeding pytorch (this will. Slow down your code a bit). Can you also try it with plain SGD (and the exactly same parameters) on GPU and CPU and post the results?
I have used same optimizer for both of the models. Let me try what you have suggested. Again thank you so much for your effort. I appreciate it a lot
Seems to work. Maybe you need to train a bit longer if you use non-deterministic behavior but in general it should converge too.