Worker 1:
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc(“worker0”, rank=0, world_size=2,
rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
init_method=‘tcp://0.tcp.ngrok.io:18324’))
while True:
rref1 = rpc.remote(“worker1”, torch.add, args=(torch.ones(2), 3),timeout=0)
rref2 = rpc.remote(“worker1”, torch.add, args=(torch.ones(2), 1))
x = rref1.to_here() + rref2.to_here()
rpc.shutdown()
Worker 2:
import torch.distributed.rpc as rpc
rpc.init_rpc(“worker1”, rank=1, world_size=2,rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
init_method=‘tcp://0.tcp.ngrok.io:18324’))
rpc.shutdown()
I Have been trying to run these two codes on two different systems(completely different and only connected via internet).
I just don’t understand what I am doing wrong as I am getting this error.
RuntimeError: Stop_waiting response is expected
Also I have a few questions regarding RPC:
- Can I have a system where I can send every single model blocks(Upto 96Transformer Blocks) on different GPU on multiple cloud service? Is it going to be slow?
- If I can do the above how should I even begin?