I need to create a reproducable code for PPO algorithm but in a multiagent environment. When I applied the same algorithm in a single agent setting the reproducibility was succesfull.
I think the problems stands in the initialization of the actor and critic networks.
I create the networks using the code snippet bellow, similar code snippet for the actor networks as well.
for _ in range(len(self.env.agents)):
torch.manual_seed(self.seed)
critic = Critic(sum(obs_size), self.hidden_size).to(self.device)
self.critics.append(critic)
Also, the environment is initialized using a seed as well, as shown bellow:
def init_env(gym_id: str, seed: int):
env = make_env(gym_id, discrete_action=True)
env.seed(seed)
np.random.seed(seed)
return env, seed
Additionally, the buffer from which the agents update their policy is also set to a specific seed, as shown bellow:
ids = np.arange(self.trajectory_size)
for agent, _ in enumerate(self.env.agents):
for epoch in range(self.epochs):
np.random.seed(self.seed)
np.random.shuffle(ids)
Is there anything else that is sensitive to randomness (and has to be set to a seed) in the PPO algorithm that I cannot see?