Is this PPO training code wrong?

BTeddy · January 2, 2025, 5:13am

I saw this training code in 2 different source (a open source project and my former colleague). I always consider it’s correct and use it.

ratios = torch.exp(torch.clamp(a_logp_new - torch.sum(a_logp[idx], dim=-1), max=4))
surr1 = ratios * adv[idx]
surr2 = torch.clamp(ratios, 1 - self.clip_ep, 1 + self.clip_ep) * adv[idx]
actor_loss = - torch.min(surr1, surr2).mean()

but now, I’m wandering it is wrong when the advantage is negetive.
assume that clip_ep is 0.1, ratio is 2; when adv is positive, the loss is - 1.1 * adv; when adv is negetive, the loss is -2 * adv.
positive and negetive adv with same absolute value cause unsymmetrical influence
and I guess this is the reason cause my Categorical distribution become NAN like this discussion(Categorical distribution returning breaking)

vmoens · January 17, 2025, 10:59am

If your advantage is negative, you want to “push away” the parameter configuration from that outcome so everything should work fine. The min operator ensures that you’re taking the most pessimistic estimate.