RuntimeError: Stop_waiting response is expected. RPC

Sanket_Desai1 · May 19, 2023, 1:10am

Worker 1:
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc(“worker0”, rank=0, world_size=2,
rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
init_method=‘tcp://0.tcp.ngrok.io:18324’))
while True:
rref1 = rpc.remote(“worker1”, torch.add, args=(torch.ones(2), 3),timeout=0)
rref2 = rpc.remote(“worker1”, torch.add, args=(torch.ones(2), 1))
x = rref1.to_here() + rref2.to_here()

rpc.shutdown()

Worker 2:
import torch.distributed.rpc as rpc
rpc.init_rpc(“worker1”, rank=1, world_size=2,rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
init_method=‘tcp://0.tcp.ngrok.io:18324’))
rpc.shutdown()

I Have been trying to run these two codes on two different systems(completely different and only connected via internet).
I just don’t understand what I am doing wrong as I am getting this error.

RuntimeError: Stop_waiting response is expected

Also I have a few questions regarding RPC:

Can I have a system where I can send every single model blocks(Upto 96Transformer Blocks) on different GPU on multiple cloud service? Is it going to be slow?
If I can do the above how should I even begin?

H-Huang · May 22, 2023, 10:40pm

I haven’t tried to repro this, but my intuition for the error (RuntimeError: Stop_waiting response is expected) is that the “worker1” has finished it’s execution and shutdown() such that “worker0” can no longer communicate with it. You will need to make “worker1” long running.

Can I have a system where I can send every single model blocks(Upto 96Transformer Blocks) on different GPU on multiple cloud service? Is it going to be slow?

Is there a reason you need to send the entire model? Is this for some specific type of architecture like parameter server?