I am currently implementing policy gradients in PyTorch. For some reason not relevant to the question, I cannot directly compute the gradients using backward() as follows (this code works perfectly fine):
n_episodes = len(states) states = torch.tensor(np.array([state for episode in states for state in episode[:-1]])).float() actions = torch.tensor(np.array([action for episode in actions for action in episode])).float() advantages = torch.tensor(self.compute_advantages(rewards, normalize=True)).float() std = torch.exp(self.log_std) log_probs = torch.distributions.normal.Normal(self.forward(states), std).log_prob(actions).flatten() loss = - torch.dot(log_probs, advantages) loss.backward() self.optimizer.step()
I’d rather like to compute the gradients manually state after state. I know it is much less computationally efficient, but this is not the point. In my understanding, the following code should work:
for i in range(len(actions)): state = states[i] action = actions[i] advantage = advantages[i] for name, param in self.named_parameters(): std = torch.exp(self.log_std) dist = torch.distributions.normal.Normal(self.forward(torch.from_numpy(state).float()), std) param.grad -= grad(dist.log_prob(torch.from_numpy(action).float()), param) * advantages[i] self.optimizer.step()
However the computed gradients are completely different from those obtained using .backward(). Did I get anything wrong?