Here is a piece of code from my implementation of CartPole-v1.
It works, but only when optimizer.zero_grad() and step() are on the outside of the loop, otherwize no learning. I don’t quite understand this behavior. I have seen them working inside the loop in the official tutorial, though that is not RL.
self.optimizer.zero_grad() for i in range(len(recorder.state_tape)): state = recorder.state_tape[i] action = recorder.action_tape[i] reward = recorder.reward_tape[i] probs = model(state) dist = Bernoulli(probs) loss = -dist.log_prob(action) * reward # use original action loss.backward() self.optimizer.step()
The whole program is here,
https://gist.github.com/mtian2018/5dc5e69dda5666c4655676bac4dad996