For actor-critic type method? Should the calculation of target value be involved in the backpropagation

AlanSouthwark · October 6, 2021, 4:34pm

I am confused about computation for target value in Q learning and actor critic method. So during the backpropagation, does the value network being involved for two times since we call it two times for calculating current value and next state value. So should I put with torch.no_grad() before calculate the next state value?

        action, log_prob = actor.forward(current_state) 
        current_value = critic.forward(current_state)

        action = action.detach().cpu().numpy()

        new_state, reward, done, info = env.step(action)

        reward = torch.from_numpy(np.array(reward)).type(torch.FloatTensor).to(device)
        new_state = torch.from_numpy(new_state).type(torch.FloatTensor).to(device)

        # should I put with torch.no_grad() at here?
        with torch.no_grad() 
            next_value = self.critic.forward(new_state)
       
        target_value = reward + gamma * next_value * (1 - int(done))
        advantage = target_value - current_value

        actor_loss = -1 * log_prob * advantage
        critic_loss = advantage ** 2
        loss = (actor_loss + critic_loss)

        actor_optimizer.zero_grad()
        critic_optimizer.zero_grad()
        loss.backward()
        actor_optimizer.step()
        critic_optimizer.step()