Replay buffer with policy gradient


I’d like to ask a simple question regarding a training based on policy gradient and experience replay.

When I am collecting experiences by interacting with an environment with a policy approximated with neural networks, should I turn off history tracking with no_grad or should I leave it and let it be tracked?

I think we need to track only when we replay the buffer containing experience trajectories but I am not so sure.

I’ve looked up a few code snippets and all of them didn’t turn off history tracking when they collect experience trajectories to store in the replay buffer. Then all the accumulated operations will be back propagated along with arbitrary many sampled trajectories to play out as many times as we wish. To me only sampled trajectories are to be used for training but not the ones used to collect for replay.

Hope you got to what I asking with my description. I am new to PyTorch and feel free to correct me if get anything wrong.



Not sure I can speak for which repos you are referring to, but if you look at the official DQN tutorial:

you can see that select action (what is used during the collection of experience tuples), is using with torch.no_grad()

def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            return policy_net(state).max(1)[1].view(1, 1)
        return torch.tensor([[random.randrange(2)]], device=device, dtype=torch.long)