MultiPlayer weight sharing of exact same network

Hey there,

I implemented a four player card game using pytorch and reinforcement learning (PPO). To train the agents I make four exact copys and let them play against each other. I now would like to share after a certain update time the weights between this same networks.

I found this procedure:

  1. Make all your modules

  2. Make all your clones

  3. Add all the modules and clones to a single nn.Container

  4. Call :getParameters on the nn.Container to get params and grads. This will preserve any sharing of parameters between modules inside the nn.Container.

  5. Now using the modules and clones as normal will play nice with optim because all of the params and grads reference the same storage as the tensors from :getParameters.

I tried to implement it as follows: (ppo are the models)

            container = nn.Container()
            for i in range(4):
            params = container.parameters()

How to apply the parameters now back to each model?
Is the above method the correct approach?

Further Snippets:

class PPO:
    def __init__(self, state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip, lr_decay=1000000): = lr
        self.betas = betas
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.K_epochs = K_epochs

        self.policy = ActorCritic(state_dim, action_dim, n_latent_var)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas, eps=1e-5) # no eps before!
        self.policy_old = ActorCritic(state_dim, action_dim, n_latent_var)
        #TO decay learning rate during training:
        self.scheduler = torch.optim.lr_scheduler.StepLR(self.optimizer, step_size=lr_decay, gamma=0.9)
        self.MseLoss = nn.MSELoss()

class ActorMod(nn.Module):
    def __init__(self, state_dim, action_dim, n_latent_var):
        super(ActorMod, self).__init__()
        self.l1      = nn.Linear(state_dim, n_latent_var)
        self.l1_tanh = nn.PReLU()
        self.l2      = nn.Linear(n_latent_var, n_latent_var)
        self.l2_tanh = nn.PReLU()
        self.l3      = nn.Linear(n_latent_var+60, action_dim)

    def forward(self, input):
        x = self.l1(input)
        x = self.l1_tanh(x)
        x = self.l2(x)
        out1 = self.l2_tanh(x) # 64x1
        if len(input.shape)==1:
            out2 = input[180:240]   # 60x1 this are the available options of the active player!
            output [out1, out2], 0)
            out2 = input[:, 180:240]
            output [out1, out2], 1) #how to do that?
        x = self.l3(output)
        return x.softmax(dim=-1)

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, n_latent_var):
        super(ActorCritic, self).__init__()

        # actor
        #TODO see question:
        self.action_layer = ActorMod(state_dim, action_dim, n_latent_var)

        # critic
        self.value_layer = nn.Sequential(
                nn.Linear(state_dim, n_latent_var),
                nn.Linear(n_latent_var, n_latent_var),
                nn.Linear(n_latent_var, 1)

Ok here is what I now tried:

I have exact 4 same models which are playing against each other and trained using PPO.

I now tried to calculate the mean of the parameters and update the models like this:

            for [dict1, dict2, dict3, dict4] in zip(ppo[0].policy.state_dict().items(), ppo[1].policy.state_dict().items(), ppo[2].policy.state_dict().items(), ppo[3].policy.state_dict().items()):
                val1, val2, val3, val4 = dict1[1], dict2[1], dict3[1], dict4[1]
                final_dict[dict1[0]] = (val1+val2+val3+val4)/4
            for i in range(4):

However this is not working correctly… (it does not learn faster…)

Any ideas?

Hi Markus,
I’m also busy with a 4 player card game and I’m trying to learn from your code. Thanks for sharing!
With respect to the neural network I’m following a different approach: As my network is memoryless (everything comes in via observations) and the game itself is turn-based I use the very same network for all 4 players. So the learning rate is 4 times higher and there is no need to share the weights.

Hey Chris,

sounds nice - what game are you referring to? Whats the name of the game?

The game is called ‘Rikken’. It’s played in parts of the Netherlands and Belgium. It has some resemblance with Bridge.

You have a bidding phase and a playing phase. There are several game types (numbers of tricks to win, trump, teams 1 vs 3 or 2 vs 2).

So all by all , it’s quite a challenge.

Right now I’m trying to understand and learn from your code at mcts_cardgame/ at master · CesMak/mcts_cardgame · GitHub