Optimizer zero_grad() / step() only works outside of loop?

Here is a piece of code from my implementation of CartPole-v1.

It works, but only when optimizer.zero_grad() and step() are on the outside of the loop, otherwize no learning. I don’t quite understand this behavior. I have seen them working inside the loop in the official tutorial, though that is not RL.

    for i in range(len(recorder.state_tape)):
        state = recorder.state_tape[i]
        action = recorder.action_tape[i]
        reward = recorder.reward_tape[i]

        probs = model(state)
        dist = Bernoulli(probs)
        loss = -dist.log_prob(action) * reward  # use original action


The whole program is here,



The difference between putting it inside or outside the loop is the difference between a batch of size 1 or len(recorder.state_tape). So it is possible that in your case, this larger batch size is needed for the training to be stable and the model to learn?