Understanding backward in reinforce

Hi guys,
I’m confused about backward in reinforce.
This is my implementation:

 for sample, rew in zip(self.policy_history, reward):
     loss = torch.sum(torch.mul(-torch.log(sample.clamp(min=1e-6)), rew), -1)

policy_loss = torch.stack(policy_loss).sum() / batch_size

But is reinforce supposed to be optimized with gradient ascent while my optmizer apply gradient descent?
I’m worried about if optimizer.step() is trying to bring the policy_loss to zero and making my rewards less and less.
And what changes in my code if I use negative rewards, like penalty?

Check this:
What is the difference between backpropagation and reinforcement learning, in training artificial neural networks? Are the two techniques completely different or related?

Thanks, I understand differences between RF and BB, but my question is how is possible that minimize a positive policy_loss means maximize rewards?
My policy loss is sum(-logprob*rew) so it’s a positive quantity.