Hi guys,
I’m confused about backward in reinforce.
This is my implementation:
for sample, rew in zip(self.policy_history, reward):
loss = torch.sum(torch.mul(-torch.log(sample.clamp(min=1e-6)), rew), -1)
policy_loss.append(loss)
policy_loss = torch.stack(policy_loss).sum() / batch_size
policy_loss.backward()
But is reinforce supposed to be optimized with gradient ascent while my optmizer apply gradient descent?
I’m worried about if optimizer.step() is trying to bring the policy_loss to zero and making my rewards less and less.
And what changes in my code if I use negative rewards, like penalty?