Hello !
I’m trying to implement an actor-critic algorithm using PyTorch. My understanding was that it was based on two separate agents, one actor for the policy and one critic for the state estimation, the former being used to adjust the weights that are represented by the reward in REINFORCE. I recently found a code in which both the agents have weights in common and I am somewhat lost.
Let me first introduce the context;
I previously implemented REINFORCE with baseline using the following strategy. I’m not sure it’s correct but the agent being successful in various environments, I thought it had to be. Here’s what I did:
- Play a full episode, record transition
- Discount reward
- Apply the baseline: For each state of the episode, I use the critic to evaluate the state and substract this value to the corresponding reward
- Update the critic : Target = R_state + y*V(next_state). Minimize loss between Target and estimated state value
- Update the policy: I use the adjusted reward vector as the weight vector of REINFORCE in which it is just the discounted reward. Hence: loss = - sum(log(selected_prob)*weights)
REINFORCE with baseline seemed easier because you basically just replace the weights vector.
Afterwards, I was indeed expecting to follow a similar strategy with AC. Indeed:
- The critic update has the same form
- For the actor, I’d use this delta = R_state + y*V(next_state) - V(state) as weights, as explained in Sutton’s book
While browsing the web for confirmation, looking for an understandable implementation on Github, I came across a code in which the actor and the critic have a common network and only the last layer brings specification. Though I’m highly in favor of sharing, I do not understand this setting and how the backprop works in this case.
Here’s the link to my baseline implementation: https://github.com/Mehd6384/RL/blob/master/Baseline.py
Here’s the link to the puzzling (but working) implementation in question: https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py
Do the actor and the critic have to share the network ? Or no ? Or is it given the user’s taste ? How can the puzzling and working implementation join both the losses and backpropagate ?
Thanks a lot !