How to rewrite REINFORCE without using .reinforce()?

According to the formula of REINFORCE, we could define the loss function as log-probability of actions multiplied by the rewards. I am curious if it is to replace action.reinforce(r) as loss = outputs of final layer * rewards ?

There’s an implementation here without using .reinforce. (

basically, what you wrote is mostly correct, however you need to sample the categorical output (i.e the softmax output from the final output layer) to obtain an action, then you use the action to index your categorical output, and take log, then you can multiply the logprob by reward and sum it up like you wrote. The code snippets I’m talking about above are below:

probs = self.model(Variable(state)) # run net, get categorical probs output
action = probs.multinomial().data # sample to get action index
prob = probs[:, action[0,0]].view(1, -1) # index probs with action selection
log_prob = prob.log() # compute log prob

ith_step_loss = -log_prob*reward

You see the above loss implemented in line 52. Execept in this case they’re summing up loss across the entire trajectory, whereas I’ve written it above for one step action/reward.


Can we use the nllloss() function augmented by the reward in this case (if I want to do it in an off-line batch training way)?

Yes, you can use batch of -logprobs*reward in batch update mode, instead of summing up the loss as above. The current Pytorch 0.3 version torch.distribution can compute logprobs for batch of data.