I am trying to learn pong by scaling the loss gradients with rewards but it is not learning anything.
I have not done discounting because I think in some problems this might not be correct for example when producing a word sequence.
def update_grad(grad): grad = torch.mul(grad, rewards_tensor) return grad
Here is my current implementation:
What is the correct way to do this? I am learning pytorch and I am very new to RL.
Thanks a lot,