Centralized learning-decentralized execution clarification (engineering perspective on PPO algo)

Hi everyone,

I can understand the theoritical concept of the centralized learning-decentralized execution approach, but I am quite confused about the coding-engineering changes to be done in the update of the networks in the PPO algo.

I think that the actor network (I have seperate networks) will use each agent’s actor loss to update the network, but how the critcs are updated? Should I calculate the cummulative critic loss (from all the agents) and backpropagate it in every single critic network?

Is it in a multi-agent scenario? Do you want to create a multi-agent PPO framework similar to MADDPG?

Yes, that is correct. I found this paper that created a multiagent version of the PPO algorithm with a centralized approach. The problem is that they have not published the code yet.

Well, then its very simple, critic V(x)_i of agent i will collect the current observed state of agent i and the so called “partial state” of other agents j, then predict a value. So the code structure of the training part would be parallelly sample an action from all agent policies P(x)_m, m \in {0, n} where n is the number of agents, then you can exchange the state information between all agents, then parallelly use critics to evaluate values, then parallely update critics, and finally parallely update actors using values produced by critics.

The simplest (with medium performance) way to implement this is using a threadpool, eg: the one from joblib or multiprocessing module, if your model is small. A more complex way would be using torch.jit.fork_, which is even better than using the threadpool because it completely avoids the GIL problem (about 50% faster than threadpool). ProcessPool is only recommended if your model is on aa GPU and very large (like using a resnet or something of similar complexity and parameter size, there are many caveats in using a ProcessPool and I don’t recommend it.

So in general, there will be N=number_of_agents critic and actor networks. The centralized learning appears in the fact that each critic network recieves state information from the other agents and then updates its weights?

So here all the agents make an action simultaneously.

Given the action each agent made in the previous step, they transision to a new state. The state of each agent will be shared to all the agents. So basically, every agent will be aware of the position of every agent.

Here the critics are evaluating its agent’s actions but being aware of the new dynamics of the environments? (after getting informed of the relative position of every agent)

This step is related to the previous one so the question remains the same.

I believe then that the best choice is torch.jit.fork_ , which I understood that can run in CPU as well.

I am really sorry for the questions, I just want to be sure I understand this correctly! :smiley:

Thank you so much!

Yes

Yes

Yes

Well, PPO critic do not evaluate actions of agents, just state of the current agent, and partial states from other agents, since the critic represents value function V(s) and not Q value function Q(s, a).

Yes

A reference to my MADDPG implementation, includes three variants of parallel mechanism mentioned above, if you would like to take a look:

@iffiX Thank you very much! I will take a look right away! You were very informative!

Hi @iffiX,

I used the multiprocessing module in order to parallelize the experiment in the multiagent particle environment. I saw that you used the same env as well. What happens in my code is that when I apply the step function (the one from the environment’s module) I get this error:

_pickle.PicklingError: Can't pickle <class '.Scenario'>: import of module '' failed

Apparently this Scenario class can’t be pickled. Are you aware of a wrapper that avoids this in some way?

I would recommend you send a command to create the environment inside the subprocess directly, instead of pickling it, there are many problems related to pickling gym-based environments and are way too painful to deal with.

I made a wrapper for parallelizing openai gym execution in subprocesses at:


It might will also work for the particle env, but I have not tested it.

Thank you @iffiX, again I am gratefull for your feedbacks. Just one last question, how is the state of each agent shared to every critic network of all agents?

What I do not understand is this: If we had decentralized learning, each critic network of every agent i would recieve as input just one vector, the state vector of the agent i. Now that the states are shared, how should I add the state vectors of the other agents into the agent's i critic network?

Should it be a list of tensors, where each element of the list is the state tensor of each agent in the environment?

Finally, I was thinking that the state of agent i must greater affect the policy of the agent i compared to the effect of the state of the agent j or agent z to the policy of agent i. If this is true, the network’s weights will figure this out and adjust the weights in order to make more significant the state vector of agent i?

Thank you in advance!

In the centralized training, decentralized acting scenario, yes, critics will observe state vectors of other agents, and it is usually done by collecting states of all agents and concatenate them into a fixed dimension state tensor and feed into the critic network. You can of course permute the order of states if your agents are homogenous, this may increase network robustness.

It is really unknown whether your network will take agent i more seriously than other agents, however, you may explicitly apply the attention mechanism, which is usually used in NLP areas, to sort of peeking into the thought of your critic network.

Really interesting conversation.

Another topic is whether its better to have one critic network per agent or one critic network for each cluster of homogenous agents, as in the centralized training scenario the input vector will be the same for every critic. Some claim that this might have better performance.
However, I suppose that the attention mechanism scenario is can only be applied for multiple critic networks.

You can of course permute the order of states if your agents are homogenous, this may increase network robustness.

What I did is concatenating the states and feed them to every critic network with the same order. I could instead try to always feed the critic of agent i, with the state of agent i being in the start of the concatenated states.

Thank you for your detailed answer @iffiX !

I have tried:

  1. permute order randomly
  2. make agent i first
  3. order states from 1 to n
    In my undergraduate thesis, there is no significant difference.

A visualization of attention weight, using method 3:

For single/multiple critics in the homogenous scenario:
I have tried:

  1. one central critic
  2. one critic for every agent
  3. one critic for every agent, but average their parameters every few episodes. ( by mean() their state dict and broadcast new mean parameters to all critics)
    1 performs best and 3 performs the worst, I assume that averaging has damaged the information collected by critic.

You may try these methods yourself as I have only test my solution with one multi-agent environment modified from the bipedalwalker-v2, contact me if you would like to take a look at the modified implementation:

image

@iffiX impressive experimentation, congratulations!!

I will, thank you very much for the informaiton! These are very interesting topics.