hi @mrshenli
here are the errors i am getting.
Error on Master/primary server:
Traceback (most recent call last):
File "ddp.py", line 48, in <module>
main()
File "ddp.py", line 42, in main
mp.spawn(example,
File "/apps/miniconda3/envs/torch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/apps/miniconda3/envs/torch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/apps/miniconda3/envs/torch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/apps/dl/torch/ddp.py", line 22, in example
ddp_model = DDP(model)
File "/apps/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 301, in __init__
self._distributed_broadcast_coalesced(
File "/apps/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 485, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: [/opt/conda/conda-bld/pytorch_1579022027171/work/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [SECONDARY_IP]:43924
ERROR on secondary:
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1579022027171/work/third_party/gloo/gloo/transport/tcp/device.cc:281] rv != -1. -1 vs -1. epoll_ctl: No such file or directory
Traceback (most recent call last):
File "ddp.py", line 48, in <module>
main()
File "ddp.py", line 42, in main
mp.spawn(example,
File "/apps/miniconda3/envs/torch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/apps/miniconda3/envs/torch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
raise Exception(
Exception: process 1 terminated with signal SIGABRT
here is the reference in both servers:
Master:
def example(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
secondary:
def example(rank, world_size):
os.environ['MASTER_ADDR'] = 'master_ip'
os.environ['MASTER_PORT'] = '12355'