Distributed training hangs

amsword · May 26, 2019, 9:29pm

The training hangs without printing any logs. Observations/configurations:

4 nodes. 4 GPU/node.
distributed training with each process taking 1 GPU.
pytorch version = 1.1; cuda version = 9.0; gpu driver version: 410.78
use the code base of facebook/maskrcnn-benchmark, but i thought it is just normal pytorch code.
GPU utility is close to 100%, but there is no more log.
it has finished 278K iterations, and then hangs there without any progress (no more snapshot, no more logs)
gdb attached to one of the process (sudo gdb -p process_id) and it seems like it hangs at cuMemcpyHtoDAsync_v2.

(gdb) where
#0  0x00007ffe309e1b6d in clock_gettime ()
#1  0x00007f8cc536f876 in __GI___clock_gettime (clock_id=4, tp=0x7ffe30898660) at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007f8c6c7ecc4e in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#3  0x00007f8c6c87b8d3 in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#4  0x00007f8c6c89b81f in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#5  0x00007f8c6c7c8737 in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#6  0x00007f8c6c6d9e4e in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#7  0x00007f8c6c6dbfc3 in ?? () from /usr/local/nvidia/lib64/libcuda.so.1
#8  0x00007f8c6c829c82 in cuMemcpyHtoDAsync_v2 () from /usr/local/nvidia/lib64/libcuda.so.1
#9  0x00007f8cbe7ad49c in ?? () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f8cbe78a573 in ?? () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f8cbe7c3d86 in cudaMemcpyAsync () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f8c836a9f4b in (anonymous namespace)::copy_from_cpu(at::Tensor&, at::Tensor const&) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f8c8374a875 in void (anonymous namespace)::_copy__cuda<float>(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f8c836aafb8 in at::native::_s_copy__cuda(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f8c826d47ef in at::CUDAType::s_copy_(at::Tensor&, at::Tensor const&, bool) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#16 0x00007f8c7764033d in at::native::copy_(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f8cbf546dc9 in torch::autograd::VariableType::copy_(at::Tensor&, at::Tensor const&, bool) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#18 0x00007f8c777829cc in at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#19 0x00007f8c77a01857 in at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#20 0x00007f8cbf31cb52 in torch::autograd::VariableType::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const ()
   from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#21 0x00007f8cc0bb8eb3 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#22 0x00007f8cc0bb9598 in torch::autograd::THPVariable_to(_object*, _object*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#23 0x0000556d518096a6 in PyCFunction_Call () at /tmp/build/80754af9/python_1546130271559/work/Objects/methodobject.c:98
#24 0x0000556d518b74ad in do_call_core (kwdict=0x7f8b9b42b168, callargs=0x7f8bb1deec18, func=0x7f8b9b42b510) at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:5116
#25 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546130271559/work/Python/ceval.c:3404

amsword · May 26, 2019, 10:18pm

found the issue.

gemfield · September 3, 2020, 1:04pm