Hi everyone,
Perhaps I am very much misunderstanding some of the semantics of loss.backward()
and optimizer.step()
. In the Pytorch example implementation of the REINFORCE algorithm, we have the following excerpt from the finish_episode()
function
for log_prob, R in zip(policy.saved_log_probs, returns):
policy_loss.append(-log_prob * R)
optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()
In the REINFORCE algorithm described in the Reinforcement Learning book by Richard S. Sutton and Andrew G. Barto, the reinforce update for the parameter vector, \theta
, is done by
\theta_{t+1} = \theta_t + \alpha G_t (\nabla\pi(A_t|S_t,\theta_t) / \pi(A_t|S_t,\theta_t)),
i.e., the parameter vector is updated in every step through gradient ascent.
Maybe I am wrong, but policy_loss.backward()
appears to compute the gradient of all the arguments in the loss with respect to a single parameter vector \theta
, and then optimizer.step()
essentially adds these gradients assuming that \theta_t
is the same for all t values, which does not seem to be equivalent to the theoretical implementation.
Is there something that I am missing or not seeing clearly here?
Thank you for your help.