Pointer Networks with RL - problem with action.reinforce(r)

I’m working on an implementation of Neural Combinatorial Optimization with RL, and I got a bit stuck on the reinforce update for the pointer network.

Essentially, I’m using torch.multinomial to select from the input at each step of decoding to construct the output, which is a permutation of the inputs (FYI I call the elements from the input that the pointer network selects the “actions”). Then, I run the “actions” through a reward function, and call action.reinforce(r) for each “action”. However, what I think is happening is that since the actions are not directly the result of the call to torch.multinomial (the indices I used to select the actions from the input are), I’m getting the error:

RuntimeError: reinforce() can be only called on outputs of stochastic functions

Is there any way to get around this so I can still use pytorch’s reinforce method? Thanks in advance!

Update: I ditched the reinforce method and I’m now simply computing the loss as logprobs * (reward - baseline), averaged over batch size B, where logprobs is the sum of the log probabilities of the outputs of the pointer network. The reward is computed based on the actions, not the action indices, which are the outputs of torch.multinomial.

Maybe helpful. I had a similar problem.

1 Like