I saw this training code in 2 different source (a open source project and my former colleague). I always consider it’s correct and use it.
ratios = torch.exp(torch.clamp(a_logp_new - torch.sum(a_logp[idx], dim=-1), max=4))
surr1 = ratios * adv[idx]
surr2 = torch.clamp(ratios, 1 - self.clip_ep, 1 + self.clip_ep) * adv[idx]
actor_loss = - torch.min(surr1, surr2).mean()
but now, I’m wandering it is wrong when the advantage is negetive.
assume that clip_ep is 0.1, ratio is 2; when adv is positive, the loss is - 1.1 * adv; when adv is negetive, the loss is -2 * adv.
positive and negetive adv with same absolute value cause unsymmetrical influence
and I guess this is the reason cause my Categorical distribution become NAN like this discussion(Categorical distribution returning breaking)