I have a model which has a single sample from a multinomial pretty far upstream. Other than that it’s a standard supervised learning model, resulting in a single loss value at the end.
I’ve got my head around the reinforce function. I now do the following:
Compute a loss vector L over the batch (i.e. the batch loss, but not summed)
Pass the negative of this loss to reinforce() on the Variable returned by torch.multinomial().
Call backward() on the sum of the loss.
Call optimizer.step()
Doing this, the gradient over the parameters of the multinomial is None. Only if I call backward() also on the output of torch.multinomial() (after calling reinforce()) do I get a gradient. Is this the correct approach, or am I misunderstanding something?