I am currently trying to understand the actor critic example for the cart-pole environment.
I understand the general principle and how the algorithm works. However in this code we have a neural net with two heads. One output for our actions and another one for our predicted future reward based on the current state.
My question is about these few lines of code:
# reset gradients
optimizer.zero_grad()
# sum up all the values of policy_losses and value_losses
loss = torch.stack(policy_losses).sum() + torch.stack(value_losses).sum()
# perform backprop
loss.backward()
optimizer.step()
# reset rewards and action buffer
del model.rewards[:]
del model.saved_actions[:]
Here we are adding the policy loss and the value loss. But as i understood it, we should call backward once for the loss of the policy and once for the value loss? Why is it sufficient here to just add them up?
Thanks in advance