Init_process_group() hangs sometimes (not always) with pytorch 1.0

with python 3.6.7 + pytorch 1.0.0, init_process_group() sometimes hangs and never returns. Any idea of how to fix it? I need to run some projects under 1.0 version. Here are details.

Code scripts

a.py

import torch
import torch.distributed as dist
import os

def get_mpi_rank():
    return int(os.environ['RANK'])

def get_mpi_size():
    return int(os.environ.get('WORLD_SIZE', '1'))

rank = get_mpi_rank()
world_size = get_mpi_size()

init_param={'backend': 'nccl',
        'init_method': 'env://',
        'rank': rank,
        'world_size': world_size}

from pprint import pformat
print('before {} - {}\n'.format(rank,
    pformat(init_param)))
dist.init_process_group(**init_param)
print('after {}'.format(rank))

When it works

python 2.7.12 + pytorch 0.4.1

$ python --version
Python 2.7.12
$ python -c 'import torch; print torch.__version__'
0.4.1
$ python -m torch.distributed.launch --nproc_per_node 2 a.py
before 1 - {'backend': 'nccl', 'init_method': 'env://', 'rank': 1, 'world_size': 2}

before 0 - {'backend': 'nccl', 'init_method': 'env://', 'rank': 0, 'world_size': 2}

after 0
after 1

If i run the scripts multiple times, it always succeeds.

When it does not work

$ python --version
Python 3.6.7 :: Anaconda, Inc.
$ python -c 'import torch; print(torch.__version__)'
1.0.0
$ python -m torch.distributed.launch --nproc_per_node 2 a.py
before 1 - {'backend': 'nccl', 'init_method': 'env://', 'rank': 1, 'world_size': 2}

before 0 - {'backend': 'nccl', 'init_method': 'env://', 'rank': 0, 'world_size': 2}

after 0

The rank 0 is able to finish the function call of init_process_group, but the rank 1 never returns. Then, I use the gdb to attach the hanged process.

$ sudo gdb -p 40855
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 40855
Reading symbols from /raid/jianfw/anaconda3/bin/python3.6...done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug/.build-id/ce/17e023542265fc11d9bc8f534bb4f070493d30.debug...done.
done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.23.so...done.
done.
(gdb) where
#0  0x00007f0ce5586c00 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f0cde561e88 in c10d::tcputil::connect(std::string const&, unsigned short, bool, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) ()
   from /raid/jianfw/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#2  0x00007f0cde55cef5 in c10d::TCPStore::TCPStore(std::string const&, unsigned short, bool) () from /raid/jianfw/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#3  0x00007f0cde4f09f6 in void pybind11::cpp_function::initialize<void pybind11::detail::initimpl::constructor<std::string const&, int, bool>::execute<pybind11::class_<c10d::TCPStore, std::shared_ptr<c10d::TCPStore> >, , 0>(pybind11::class_<c10d::TCPStore, std::shared_ptr<c10d::TCPStore> >&)::{lambda(pybind11::detail::value_and_holder&, std::string const&, int, bool)#1}, void, pybind11::detail::value_and_holder&, std::string const&, int, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor>(void pybind11::detail::initimpl::constructor<std::string const&, int, bool>::execute<pybind11::class_<c10d::TCPStore, std::shared_ptr<c10d::TCPStore> >, , 0>(pybind11::class_<c10d::TCPStore, std::shared_ptr<c10d::TCPStore> >&)::{lambda(pybind11::detail::value_and_holder&, std::string const&, int, bool)#1}&&, void (*)(pybind11::detail::value_and_holder&, std::string const&, int, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /raid/jianfw/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#4  0x00007f0cddff0e36 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /raid/jianfw/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#5  0x000055b391cbe3d4 in _PyCFunction_FastCallDict () at /tmp/build/80754af9/python_1540319457073/work/Objects/methodobject.c:231
#6  0x000055b391cbe7ef in _PyObject_FastCallDict () at /tmp/build/80754af9/python_1540319457073/work/Objects/abstract.c:2313
#7  0x000055b391cc3303 in _PyObject_Call_Prepend () at /tmp/build/80754af9/python_1540319457073/work/Objects/abstract.c:2373
#8  0x000055b391cbe1de in PyObject_Call () at /tmp/build/80754af9/python_1540319457073/work/Objects/abstract.c:2261
#9  0x000055b391d1b78b in slot_tp_init () at /tmp/build/80754af9/python_1540319457073/work/Objects/typeobject.c:6420
#10 0x000055b391d47f57 in type_call () at /tmp/build/80754af9/python_1540319457073/work/Objects/typeobject.c:915
#11 0x000055b391cbe5bb in _PyObject_FastCallDict () at /tmp/build/80754af9/python_1540319457073/work/Objects/abstract.c:2331
#12 0x000055b391d47d6e in call_function () at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:4861
#13 0x000055b391d6a71a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:3335
#14 0x000055b391d4a860 in gen_send_ex (closing=0, exc=0, arg=0x0, gen=0x7f0ca53fed58) at /tmp/build/80754af9/python_1540319457073/work/Objects/genobject.c:189
#15 gen_iternext (gen=0x7f0ca53fed58) at /tmp/build/80754af9/python_1540319457073/work/Objects/genobject.c:563
#16 builtin_next () at /tmp/build/80754af9/python_1540319457073/work/Python/bltinmodule.c:1330
#17 0x000055b391cbe311 in _PyCFunction_FastCallDict () at /tmp/build/80754af9/python_1540319457073/work/Objects/methodobject.c:234
#18 0x000055b391d47c1c in call_function () at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:4837
#19 0x000055b391d6a71a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:3335
#20 0x000055b391d42ad9 in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=2, defs=0x7f0ca578dd60, kwstep=2, kwcount=<optimized out>,
    kwargs=0x7f0ce447eae8, kwnames=0x7f0ce447eae0, argcount=<optimized out>, args=0x55b393457df0, locals=0x0, globals=<optimized out>, _co=0x7f0ca54e8270)
    at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:4166
#21 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:4187
#22 0x000055b391d43a06 in function_call () at /tmp/build/80754af9/python_1540319457073/work/Objects/funcobject.c:604
#23 0x000055b391cbe1de in PyObject_Call () at /tmp/build/80754af9/python_1540319457073/work/Objects/abstract.c:2261
#24 0x000055b391d6bd9a in do_call_core (kwdict=0x7f0ce58c0678, callargs=0x7f0ce594b048, func=0x7f0ca54f0a60) at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:5106
#25 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:3404
#26 0x000055b391d42ad9 in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=<optimized out>, kwargs=0x0, kwnames=0x0,
    argcount=0, args=0x0, locals=0x7f0ce5901360, globals=0x7f0ce5901360, _co=0x7f0ce448e5d0) at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:4166
#27 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:4187
#28 0x000055b391d4387c in PyEval_EvalCode (co=co@entry=0x7f0ce448e5d0, globals=globals@entry=0x7f0ce5901360, locals=locals@entry=0x7f0ce5901360)
    at /tmp/build/80754af9/python_1540319457073/work/Python/ceval.c:731
#29 0x000055b391dc4074 in run_mod () at /tmp/build/80754af9/python_1540319457073/work/Python/pythonrun.c:1025
#30 0x000055b391dc4471 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1540319457073/work/Python/pythonrun.c:978
#31 0x000055b391dc4673 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1540319457073/work/Python/pythonrun.c:419
#32 0x000055b391dc477d in PyRun_AnyFileExFlags () at /tmp/build/80754af9/python_1540319457073/work/Python/pythonrun.c:81
#33 0x000055b391dc81b0 in run_file (p_cf=0x7ffc8471a94c, filename=0x55b3933a6300 L"a.py", fp=0x55b393433e20) at /tmp/build/80754af9/python_1540319457073/work/Modules/main.c:340
#34 Py_Main () at /tmp/build/80754af9/python_1540319457073/work/Modules/main.c:811
#35 0x000055b391c8fb4e in main () at /tmp/build/80754af9/python_1540319457073/work/Programs/python.c:69
#36 0x00007f0ce51cc830 in __libc_start_main (main=0x55b391c8fa60 <main>, argc=4, argv=0x7ffc8471ab58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7ffc8471ab48) at ../csu/libc-start.c:291
#37 0x000055b391d711a8 in _start () at ../sysdeps/x86_64/elf/start.S:103

Looks like it hangs at the function call of c10d::TCPStore::TCPStore(std::string const&, unsigned short, bool) (), c10d::tcputil::connect(std::string const&, unsigned short, bool, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) ()

Any idea to fix it?

Thanks

has the fix