# Is Pytorch REINFORCE implementation correct?

Hi everyone,

Perhaps I am very much misunderstanding some of the semantics of loss.backward() and optimizer.step(). In the Pytorch example implementation of the REINFORCE algorithm, we have the following excerpt from the finish_episode() function

for log_prob, R in zip(policy.saved_log_probs, returns):
policy_loss.append(-log_prob * R)
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()


In the REINFORCE algorithm described in the Reinforcement Learning book by Richard S. Sutton and Andrew G. Barto, the reinforce update for the parameter vector, \theta, is done by

\theta_{t+1} = \theta_t + \alpha G_t (\nabla\pi(A_t|S_t,\theta_t) / \pi(A_t|S_t,\theta_t)),


i.e., the parameter vector is updated in every step through gradient ascent.

Maybe I am wrong, but policy_loss.backward() appears to compute the gradient of all the arguments in the loss with respect to a single parameter vector \theta, and then optimizer.step() essentially adds these gradients assuming that \theta_t is the same for all t values, which does not seem to be equivalent to the theoretical implementation.

Is there something that I am missing or not seeing clearly here?