Is Pytorch REINFORCE implementation correct?

Hi everyone,

Perhaps I am very much misunderstanding some of the semantics of loss.backward() and optimizer.step(). In the Pytorch example implementation of the REINFORCE algorithm, we have the following excerpt from the finish_episode() function

for log_prob, R in zip(policy.saved_log_probs, returns):
    policy_loss.append(-log_prob * R)
optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()

In the REINFORCE algorithm described in the Reinforcement Learning book by Richard S. Sutton and Andrew G. Barto, the reinforce update for the parameter vector, \theta, is done by

\theta_{t+1} = \theta_t + \alpha G_t (\nabla\pi(A_t|S_t,\theta_t) / \pi(A_t|S_t,\theta_t)),

i.e., the parameter vector is updated in every step through gradient ascent.

Maybe I am wrong, but policy_loss.backward() appears to compute the gradient of all the arguments in the loss with respect to a single parameter vector \theta, and then optimizer.step() essentially adds these gradients assuming that \theta_t is the same for all t values, which does not seem to be equivalent to the theoretical implementation.

Is there something that I am missing or not seeing clearly here?

Thank you for your help.

Actually in real applications, whether to update your actor&critic per step or update them after collecting several episodes does not quite affect the performance of your algorithm.

However, it could be very inefficient to compute backward pass per step, as you cannot fill up your cpu ALU units or CUDA cores on your gpu, the computation price would be way too expensive. Therefore, batching is an economical choice., although it does not conform to the “theoretical model” of the REINFORCE algorithm.

1 Like