Hello everyone,
I am trying to use the PyTorch distributed package and gloo backend.
But, I get the following error.
Master:
Traceback (most recent call last):
File "Distributed.py", line 169, in <module>
init_processes(args.rank, size, run)
File "Distributed.py", line 80, in init_processes
dist.init_process_group(backend=backend, rank=rank, world_size=size)
File "/usr/local/lib/python3.5/dist-packages/torch/distributed/__init__.py", line 49, in init_process_group
group_name, rank)
RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: <My specified Ip> at /pytorch/torch/lib/THD/process_group/General.cpp:17
And another worker gets:
RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: 10.37.0.1 at /pytorch/torch/lib/THD/process_group/General.cpp:17
Below is how I initialize:
def init_processes(rank, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = <My specified Ip>
os.environ['MASTER_PORT'] = '8888' if rank == 0 else '31566'
print("Init Processes ->", 'backend:', backend, 'rank:', rank, 'MASTER_ADDR:', os.environ['MASTER_ADDR'],
'MASTER_PORT:', os.environ['MASTER_PORT'])
dist.init_process_group(backend=backend, rank=rank, world_size=size)
fn(rank, size)
Any thoughts on what may be causing this or how I can fix it ? Thx
ps. I use docker env with python 3.5 and PyTorch 0.3.1 with cuda9.0.