Manual gradient computation vs backward()

I am currently implementing policy gradients in PyTorch. For some reason not relevant to the question, I cannot directly compute the gradients using backward() as follows (this code works perfectly fine):

n_episodes = len(states)
states = torch.tensor(np.array([state for episode in states for state in episode[:-1]])).float()
actions = torch.tensor(np.array([action for episode in actions for action in episode])).float()
advantages = torch.tensor(self.compute_advantages(rewards, normalize=True)).float()

std = torch.exp(self.log_std)
log_probs = torch.distributions.normal.Normal(self.forward(states), std).log_prob(actions).flatten()

loss = -, advantages)


I’d rather like to compute the gradients manually state after state. I know it is much less computationally efficient, but this is not the point. In my understanding, the following code should work:

for i in range(len(actions)):
    state = states[i]
    action = actions[i]
    advantage = advantages[i]

    for name, param in self.named_parameters():
        std = torch.exp(self.log_std)
        dist = torch.distributions.normal.Normal(self.forward(torch.from_numpy(state).float()), std)
        param.grad -= grad(dist.log_prob(torch.from_numpy(action).float()), param)[0] * advantages[i]


However the computed gradients are completely different from those obtained using .backward(). Did I get anything wrong?

Have you set the random seed before both experiments. In the above two cases you are generating random numbers using different seeds and thus their value is different. With different values you get different gradients.

Thanks! The random seeds of both PyTorch and Gym are fixed to the same value in both experiments.