Selecting action of N agents inside a single GPU with torch.distributed.rpc

Hello,

I want to modify this code examples/main.py at master · pytorch/examples · GitHub which has 1 Agent and N observers which interact with an environment at the same time through torch.distributed.rpc.

My goal is to obtain N agents and 1 Simulator and the Simulator would ask the agents to sample actions and update when required.

For example to select action:

def select_actions_all_agents(self, state):

        self.current_actions = np.zeros(self.current_actions.shape, dtype=np.int32) * -1000
        futs = []
        start_time = time.time()
        for ag_rreff in self.ag_rrefs:
            # make async RPC to kick off an episode on all observers
            futs.append(
                rpc_async(
                    ag_rreff.owner(),
                    _call_method,
                    args=(Agent.select_action, ag_rreff, self.sim_rref, state)
                )
            )
            # wait until all agents have finished selecting action
            for fut in futs:
                fut.wait()

            self.time_select_action += (time.time() - start_time)
            self.num_time_select_action += 1

However, it seems that it does not reduce the inference time.

When instantiating each agent I send it inside the same GPU.

class Agent:
    def __init__(self):
        self.id = rpc.get_worker_info().id
        self.device = ("cuda" if torch.cuda.is_available() else "cpu")
        torch.manual_seed(args.seed+self.id)
        self.policy = Policy()
        self.policy.to(self.device)

I would expect that each agent would operate in parallel in the GPU and the inference time greatly reduced.

Any ideas?

Is this the same question as this post?