I am trying to do sequence classification by using policy gradient method. As I understood “reinforce” method does that out-of-the-box. What I need is to add combine binary cross entropy loss.
How should I do properly the gradient step? What I have now is something like this:
for action, r in zip(policy.saved_actions, rewards): action.reinforce(r) optimizer.zero_grad() autograd.backward(policy.saved_actions + [bce_loss], [None for _ in policy.saved_actions] + [None])
where bce_loss is the binary cross entropy for sequence classification and action is a stochastic variable. Do I do this correctly? One strange moment is if I do not do action.reinforce, I receive no error during the backpropagation step. Shouldn’t any exception be raised as we did not assign the reward on the stochastic variable?