The difference between actor-critic example and A2C?

In PyTorch example repo there is an example code of actor-critic. I am quite curious that what is the difference between this code and A2C ? Because it also learns a baseline for advantage estimate which sounds like what A2C is doing ?

for (log_prob, value), r in zip(saved_actions, rewards):
        reward = r -[0]
        policy_losses.append(-log_prob * reward)
        value_losses.append(F.smooth_l1_loss(value, Variable(torch.Tensor([r]))))

So in the current code, if you see how the loss is calculated, you’d see that the critic is taught to learn the reward of a particular state, action pair, given by r, or the action-value function rather than the advantage function. In A2C, the critic would learn the advantage function.


actor-critic is a kind of method.Advantage actor-critic(A2C) is a good implement of this method.
I don’t think what you have said is all right. Actually the reward in the code you stick is the advantage function.In the latest code, the reward has been changed to advantage.And the advantage function can be represented as QGRVDKGOS~%7D896%24DL6J0YKP,which is equal to the code : advantage = r -[0]

1 Like