Im trying to implement a policy gradient method in RL and the target of my nn is to increase the probability of the actions that leads to positive results and decrease the probability of the actions that leads to negative results.
the loss function is something like ‘A*-logp’, where A is positive or negative. However, if the agent sampled a action that leads to a negative A, then the loss would be negative. In that case, I actually want the negative loss to be ‘lower’ because then the probability of that ‘bad’ action would decrease. But it didn’t work as I imagined
here is part of my codes:
epprob = output.gather(1, epaction) loglik = torch.log(epprob) loss = -(loglik*discounted_epr).mean() optimizer.zero_grad() loss.backward() optimizer.step()
What should I do now? Is there any tutorials about how to deal with the negative loss or about the loss functions with probability? I really appreciate your help.