Distributed training hangs

The training hangs without printing any logs. Observations/configurations:

  • 4 nodes. 4 GPU/node.
  • distributed training with each process taking 1 GPU.
  • pytorch version = 1.1; cuda version = 9.0; gpu driver version: 410.78
  • use the code base of facebook/maskrcnn-benchmark, but i thought it is just normal pytorch code.
  • GPU utility is close to 100%, but there is no more log.
  • it has finished 278K iterations, and then hangs there without any progress (no more snapshot, no more logs)
  • gdb attached to one of the process (sudo gdb -p process_id) and it seems like it hangs at cuMemcpyHtoDAsync_v2.
(gdb) where
#0  0x00007ffe309e1b6d in clock_gettime ()
#1  0x00007f8cc536f876 in __GI___clock_gettime (clock_id=4, tp=0x7ffe30898660) at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007f8c6c7ecc4e in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#3  0x00007f8c6c87b8d3 in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#4  0x00007f8c6c89b81f in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#5  0x00007f8c6c7c8737 in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#6  0x00007f8c6c6d9e4e in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#7  0x00007f8c6c6dbfc3 in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#8  0x00007f8c6c829c82 in cuMemcpyHtoDAsync_v2 () from /usr/local/nvidia/lib64/libcuda.so.1
#9  0x00007f8cbe7ad49c in ?? () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f8cbe78a573 in ?? () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f8cbe7c3d86 in cudaMemcpyAsync () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f8c836a9f4b in (anonymous namespace)::copy_from_cpu(at::Tensor&, at::Tensor const&) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f8c8374a875 in void (anonymous namespace)::_copy__cuda<float>(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f8c836aafb8 in at::native::_s_copy__cuda(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f8c826d47ef in at::CUDAType::s_copy_(at::Tensor&, at::Tensor const&, bool) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#16 0x00007f8c7764033d in at::native::copy_(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f8cbf546dc9 in torch::autograd::VariableType::copy_(at::Tensor&, at::Tensor const&, bool) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#18 0x00007f8c777829cc in at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#19 0x00007f8c77a01857 in at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#20 0x00007f8cbf31cb52 in torch::autograd::VariableType::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const ()
   from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#21 0x00007f8cc0bb8eb3 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#22 0x00007f8cc0bb9598 in torch::autograd::THPVariable_to(_object*, _object*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#23 0x0000556d518096a6 in PyCFunction_Call () at /tmp/build/80754af9/python_1546130271559/work/Objects/methodobject.c:98
#24 0x0000556d518b74ad in do_call_core (kwdict=0x7f8b9b42b168, callargs=0x7f8bb1deec18, func=0x7f8b9b42b510) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5116
#25 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3404

found the issue.