I have a feeling that this code need to be modified:
for action, r in zip(model_Net.saved_actions, rewards):
action.reinforce(r)
optimizer.zero_grad()
autograd.backward(model_Net.saved_actions, [None for _ in model_Net.saved_actions])
My question is how to pass the ‘r’ and ‘action’ in a batch mode in back-propagation ? It might related to reshape the ‘action’ values in a way to allow back propagation.
Right now I came up with an idea that is to compute ‘r’ and ‘action’ in batch mode in forward passes, but update the gradients sequentially (1 sample at a time) in back-propagations (e.g., run ‘finish_episode’ several times). But it’s obviously not optimal.
If the shape of an action variable has batch dimension b, then you can call action.reinforce using a reward that’s either a scalar or a vector of length b.