Policy Gradient in Pytorch

Version 1

y = episode_a.argmax(-1)   # episode_a is in shape [T, n_actions]
action_preds = self.net(ep_s)  # action_preds is logits before softmax
neg_log_like = self.loss_fn(action_preds, y) 
loss = torch.mean(r * neg_log_like)   # r is reward

Version 2

y = torch.tensor(episode_a, requires_grad=True)
action_preds = model(ep_s)
neg_log_like = -y * torch.log(action_preds)
loss = torch.sum(neg_log_like, 1).mean()

Versions 1 and 2 seem to have same loss value. The difference is, y does not require grad in version while it does in version 2. But it is like a supervised learning backprop operation, y should be not needing require_grad. I do not understand why version 1 cannot train up the policy but version 2 can?

@smth May I have some help from you on this problem?

Sorry for opening this post. I have solved the problem. It is due to another issue that is not mentioned here.