Multi agent deep reinforcement learning to an environment with discrete action space

ivallesp · December 27, 2018, 11:32am

Hi, I have been doing the udacity deep-reinforcement-learning nanodegree and I came out with a doubt. Do you know or have heard about any cutting edge deep reinforcement-learning algorithm which can be successfully applied for discrete action-spaces in multi-agent settings?

I have been researching and I have found MADDPG and Soft Q-learning algorithms as the top ones in the state-of-the-art. I implemented the first one over an Unity environment and works well! However, they are mainly focused on environments with continuous action space. Although they can be applied to discrete action-space (e.g. MADDPG with gumbel softmax) it seems it is not what they are intended for (I have tried with MADDPG (w/ Gumbel softmax) achieving disastrous results…). In their corresponding papers they don’t give a lot of details of how to use them in these settings.

Can somebody help me with this?

Beatrice_Paige · December 27, 2018, 12:05pm

there’s quite a bit if you do a regular google search. here’s a link.

alexis-jacq · January 5, 2019, 5:52pm

Concerning the soft-Q learning approach, the adaptation to discret worlds looks simple:

in the critic update, use
Q(a,s) = r(a,s) + sum_s’ ( T(s’|a,s) * V(s’) )
V(s) = log( sum_a exp( Q(a,s) / alpha )

and directly compute the new policy
pi(a|s) = softmax( Q / alpha ) (a,s)
directly for all agents.

pszabo · January 7, 2019, 12:13pm

Hi,
Q learning was originally developed for markov decision processes with discrete action spaces.A fine example:
https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

ivallesp · January 22, 2019, 12:57pm

But this is not multi-agent…

ivallesp · January 22, 2019, 12:58pm

same here, this is not multi agent

alexis-jacq · January 22, 2019, 5:18pm

The paper you mention about multi-agent soft-Q learning is a centralized approach, where each agent are sharing a common critic, with a joint policy (one network giving as output one action per agent). My answer focused on that case.