Selecting action of N agents inside a single GPU with torch.distributed.rpc


I want to modify this code examples/ at master · pytorch/examples · GitHub which has 1 Agent and N observers which interact with an environment at the same time through torch.distributed.rpc.

My goal is to obtain N agents and 1 Simulator and the Simulator would ask the agents to sample actions and update when required.

For example to select action:

def select_actions_all_agents(self, state):

        self.current_actions = np.zeros(self.current_actions.shape, dtype=np.int32) * -1000
        futs = []
        start_time = time.time()
        for ag_rreff in self.ag_rrefs:
            # make async RPC to kick off an episode on all observers
                    args=(Agent.select_action, ag_rreff, self.sim_rref, state)
            # wait until all agents have finished selecting action
            for fut in futs:

            self.time_select_action += (time.time() - start_time)
            self.num_time_select_action += 1

However, it seems that it does not reduce the inference time.

When instantiating each agent I send it inside the same GPU.

class Agent:
    def __init__(self): = rpc.get_worker_info().id
        self.device = ("cuda" if torch.cuda.is_available() else "cpu")
        self.policy = Policy()

I would expect that each agent would operate in parallel in the GPU and the inference time greatly reduced.

Any ideas?

Is this the same question as this post?