Hi everyone,

Perhaps I am very much misunderstanding some of the semantics of `loss.backward()`

and `optimizer.step()`

. In the Pytorch example implementation of the REINFORCE algorithm, we have the following excerpt from the `finish_episode()`

function

```
for log_prob, R in zip(policy.saved_log_probs, returns):
policy_loss.append(-log_prob * R)
optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()
```

In the REINFORCE algorithm described in the Reinforcement Learning book by Richard S. Sutton and Andrew G. Barto, the reinforce update for the parameter vector, `\theta`

, is done by

```
\theta_{t+1} = \theta_t + \alpha G_t (\nabla\pi(A_t|S_t,\theta_t) / \pi(A_t|S_t,\theta_t)),
```

i.e., the parameter vector is updated in *every* step through gradient ascent.

Maybe I am wrong, but `policy_loss.backward()`

appears to compute the gradient of all the arguments in the loss with respect to a *single* parameter vector `\theta`

, and then `optimizer.step()`

essentially adds these gradients assuming that `\theta_t`

is the same for all t values, which does not seem to be equivalent to the theoretical implementation.

Is there something that I am missing or not seeing clearly here?

Thank you for your help.