PyTorch "NCCL error: unhandled system error" during backprop

I am trying to do distributed training with PyTorch and encountered a problem.
This runtime error occurs during first backwards pass (initially error occurred
on model initialization).

  File "/home/user/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/anaconda3/lib/python3.7/site-packages/mpi4py/__main__.py", line 7, in <module>
    main()
  File "/home/user/anaconda3/lib/python3.7/site-packages/mpi4py/run.py", line 196, in main
    run_command_line(args)
  File "/home/user/anaconda3/lib/python3.7/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "/home/user/anaconda3/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/user/anaconda3/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/user/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "project/main.py", line 115, in <module>
    trainer.run(config["epochs"])
  File "/home/user/project/trainer/trainer.py", line 107, in run
    self.run_epoch()
  File "/home/user/project/trainer/trainer.py", line 70, in run_epoch
    loss.backward()
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled system error

Error occurs always.
I use MPI for automatic rank assignment and NCCL as main back-end.
Initialization is done through file on a shared file system.
Each process uses 2 GPUs, processes run on different nodes.
Environment variable NCCL_SOCKET_IFNAME is set.

Does anyone know why this error may occur? Thanks in advance.

The NCCL errors can be notoriously cryptic. Can you reproduce the issue as well when you run 2 processes per machine and 4 in total (so you use just a single GPU per process)?

No, in case of one process per each gpu NCCL error doesn’t reproduce. But another problem arises: all processes freeze during DistributedDataParallel initialization.

model = DistributedDataParallel(
    model,
    device_ids=[device],
    output_device=device,
)

You can set the environment variable NCCL_DEBUG=INFO to make it output logs.

Also see: