I am running the FedAvg simulation using the Pytorch RPC, but when I run it, the server side throw the errors. It seems that it is my coding problem, but I don’t know what is the problem. Here are some related code snippets:
.
.
.
#Start training
if args.rank == 0:
for e in range(args.epoch):
processes = []
q = mp.Queue()
print("Server's Epoch:"+str(e+1))
weight = copy.deepcopy(model.state_dict())
for r in range(args.world_size):
p = mp.Process(
target=run_worker,
args=(
r,
model,
args.lr,
train_loader[r],
device,
args.epoch,
weight,
q))
processes.append(p)
p.start()
for p in processes:
p.join()
.
.
.
And for the function run_worker:
def run_worker(rank, model, lr, train_loader, device, epoch, weight, q):
out_weight = rpc.rpc_sync(f"Worker{rank}", train, args=(rank, model, lr, train_loader, device, epoch, weight))
q.put([rank, out_weight])
What is my main problem?
Error logs
Server initialized!
Server’s Epoch:1
Process Process-1:
Traceback (most recent call last):
File “/usr/lib/python3.9/multiprocessing/process.py”, line 315, in _bootstrap
self.run()
File “/usr/lib/python3.9/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/home/pi/FYP/FedAvg_RPC.py”, line 96, in run_worker
out_weight = rpc.rpc_sync(f"Worker{rank}“, train, args=(rank, model, lr, train_loader, device, epoch, weight))
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/rpc/api.py”, line 75, in wrapper
raise RuntimeError(
RuntimeError: RPC has not been initialized. Call torch.distributed.rpc.init_rpc first.
Process Process-2:
Traceback (most recent call last):
File “/usr/lib/python3.9/multiprocessing/process.py”, line 315, in _bootstrap
self.run()
File “/usr/lib/python3.9/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/home/pi/FYP/FedAvg_RPC.py”, line 96, in run_worker
out_weight = rpc.rpc_sync(f"Worker{rank}”, train, args=(rank, model, lr, train_loader, device, epoch, weight))
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/rpc/api.py”, line 75, in wrapper
raise RuntimeError(
RuntimeError: RPC has not been initialized. Call torch.distributed.rpc.init_rpc first.
Process Process-3:
Traceback (most recent call last):
File “/usr/lib/python3.9/multiprocessing/process.py”, line 315, in _bootstrap
self.run()
File “/usr/lib/python3.9/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/home/pi/FYP/FedAvg_RPC.py”, line 96, in run_worker
out_weight = rpc.rpc_sync(f"Worker{rank}", train, args=(rank, model, lr, train_loader, device, epoch, weight))
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/rpc/api.py”, line 75, in wrapper
raise RuntimeError(
RuntimeError: RPC has not been initialized. Call torch.distributed.rpc.init_rpc first.
Minified repro
No response
Versions
Collecting environment information…
PyTorch version: 1.8.0a0+37c1f4a
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 11 (bullseye) (aarch64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.15.84-v8±aarch64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy==0.812
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] torch==1.8.0a0+37c1f4a
[pip3] torchvision==0.9.0a0+01dfa8e
[conda] Could not collect