Actor Critic implementation problem

Hello !

I’m trying to implement an actor-critic algorithm using PyTorch. My understanding was that it was based on two separate agents, one actor for the policy and one critic for the state estimation, the former being used to adjust the weights that are represented by the reward in REINFORCE. I recently found a code in which both the agents have weights in common and I am somewhat lost.

Let me first introduce the context;

I previously implemented REINFORCE with baseline using the following strategy. I’m not sure it’s correct but the agent being successful in various environments, I thought it had to be. Here’s what I did:

  • Play a full episode, record transition
  • Discount reward
  • Apply the baseline: For each state of the episode, I use the critic to evaluate the state and substract this value to the corresponding reward
  • Update the critic : Target = R_state + y*V(next_state). Minimize loss between Target and estimated state value
  • Update the policy: I use the adjusted reward vector as the weight vector of REINFORCE in which it is just the discounted reward. Hence: loss = - sum(log(selected_prob)*weights)

REINFORCE with baseline seemed easier because you basically just replace the weights vector.
Afterwards, I was indeed expecting to follow a similar strategy with AC. Indeed:

  • The critic update has the same form
  • For the actor, I’d use this delta = R_state + y*V(next_state) - V(state) as weights, as explained in Sutton’s book

While browsing the web for confirmation, looking for an understandable implementation on Github, I came across a code in which the actor and the critic have a common network and only the last layer brings specification. Though I’m highly in favor of sharing, I do not understand this setting and how the backprop works in this case.

Here’s the link to my baseline implementation: https://github.com/Mehd6384/RL/blob/master/Baseline.py
Here’s the link to the puzzling (but working) implementation in question: https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py

Do the actor and the critic have to share the network ? Or no ? Or is it given the user’s taste ? How can the puzzling and working implementation join both the losses and backpropagate ?

Thanks a lot !

A good reason for sharing code between the actor and the critic is that both need to understand the environment. The model basically does this…

shared_features = understanding_environment(input)
action = actor(shared_features)
value = critic(shared_features)

shared_features is used in two different submodules and backpropagation simply adds the gradients that come from each submodule.

The puzzling code sums the losses.

loss = torch.cat(policy_losses).sum() + torch.cat(value_losses).sum()
loss.backward()

PyTorch knows how to distribute the added loss correctly to each submodule when calculating the gradients.

If the logic of backpropagation is fuzzy for you, can I suggest watching a lecture by Andrej Karpathy that goes over the details.

1 Like

Hey ! Thanks a lot for your answer.
I’m fine with backprop. Mostly my problem was that I just wasn’t sure whether the computation graph was sufficiently explicit for PyTorch to manage attributing the gradients given the loss was mixed.

Then, I guess that in REINFORCE with baseline the model should also share some weights, shouldn’t it ?

Thanks again !

Something just occured to me. If we want PyTorch to be able to compute the gradients, that means that the whole computation, including action selection has to be done within the framework. I mean that in this case, you can’t compute the policy distribution, convert it to numpy, make your stochastic pick, use it, store it and later on, when the episode is over use it back, right?
Because then the computation graph would have holes and elements missing, am I right?

Thanks !

Exactly. If you want the gradient calculation to work you must use pytorch operations. If needed you can write your own. http://pytorch.org/docs/0.3.0/notes/extending.html

1 Like

Well, I’d better say hello to convenient good ol’ functions and start get accustomed to Tensors op. :wink:
I’m highly grateful for your answers !

1 Like