PyTorch RPC multiple threading training

LW-Ricarido · September 24, 2020, 6:06am

Hello!
I try to build a distributed RL with PyTorch RPC, but I got a problem that I can’t use my GPU power fully. Here is my code structure. They are two parts, learner and actor.

class Learner:
    
    def __init__(self):
        self.policy = Policy()
        self.buffer = container()
        run_multiple_actor()
    
    def learning_loop(self):
        "backward & optimize"
        while True:
            get_trans_from_buffer()
            computing_loss()
            backward()
            optimizer.step()
    
    def get_action(self, state):
        action = self.policy(state)
        insert_to_buffer(state,action)
        return action

class Actor:
    
    def __init__(self):
        self.env = env
        self.learner_rref = rref
    
    def act_loop(self):
        while True:
            state = self.env.step(action)
            action = self.learner_rref.run_sync().get_action(state)

For learner, after initating, it runs in learning_loop. For actor, after initating, it runs in act_loop and call Learner’s get_action from remote.

The question is that could get_action threads run simultaneously on GPU? If they can, tt seems that as long as I run enough actors, I fully use my GPU power. However, after adding serveal actors, the volatile of GPU stop inscreasing and stays in a low level(eg. 20% or 30%). And I don’t think it’s a problem of my CPU cores. I have enough CPU to make all actors run simultaneously.

Could anyone point out what’s the problem of my code. I am new to PyTorch PRC, help me please.

mrshenli · October 2, 2020, 3:48pm

Hey @LW-Ricarido, sorry about the delay.

If you use regular Python functions as RPC target, multiple requests cannot run in parallel on the callee side due to Python Global Interpreter Lock (GIL). To avoid the lock, you can convert your function into a TorchScript function by adding a @torch.jit.script decorator. Some examples are available here, please search for @torch.jit.script.

This is a more thorough doc for TorchScript.
Example tests can be found here.

LW-Ricarido · October 9, 2020, 11:29am

Hey @mrshenli, thanks a lot. But I find some problems when I use @torch.jit.script decorator. When I use torch.jit.script , the complier cannot infer self.policy correctly, which is a nn.Module and an attribute of Learner. Should change class Learner and Actor to torch.jit.ScriptModule. And if I want to avoid GIL, should both callee function and caller be wrapped by @torch.jit.script?

mrshenli · October 9, 2020, 2:21pm

Hey @LW-Ricarido

And if I want to avoid GIL, should both callee function and caller be wrapped by @torch.jit.script ?

Yep, current RPC system requires caller and callee functions to be the same. You can, for example, define those functions in a utility file and let both caller and callee import that file.