I need to create a reproducable code for PPO algorithm but in a multiagent environment. When I applied the same algorithm in a single agent setting the reproducibility was succesfull.
I think the problems stands in the initialization of the actor and critic networks.
I create the networks using the code snippet bellow, similar code snippet for the actor networks as well.
for _ in range(len(self.env.agents)): torch.manual_seed(self.seed) critic = Critic(sum(obs_size), self.hidden_size).to(self.device) self.critics.append(critic)
Also, the environment is initialized using a seed as well, as shown bellow:
def init_env(gym_id: str, seed: int): env = make_env(gym_id, discrete_action=True) env.seed(seed) np.random.seed(seed) return env, seed
Additionally, the buffer from which the agents update their policy is also set to a specific seed, as shown bellow:
ids = np.arange(self.trajectory_size) for agent, _ in enumerate(self.env.agents): for epoch in range(self.epochs): np.random.seed(self.seed) np.random.shuffle(ids)
Is there anything else that is sensitive to randomness (and has to be set to a seed) in the PPO algorithm that I cannot see?