I have been trying to the MoCo code from Facebook Research on a machine with 4 GPUs but have been consistently receiving SIGKILLs.
If I run this command (which uses the nccl backend by default),
python main_moco.py -a resnet50 --lr 0.015 --batch-size 128 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 --mlp --moco-t 0.2 --aug-plus --cos
I get a SIGKILL as follows:
File "main_moco.py", line 406, in <module> main() File "main_moco.py", line 130, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 2 terminated with signal SIGKILL
With some simple print statements, I have pinpointed this issue to the init_process_group call here.
I also tried to run the code with the gloo backend. Now the init_process_group call succeeds, but my code fails with a SIGKILL just a little while later in the call to model.cuda().
To narrow in on the issue a little more, I tried running the example code in the Setup section of the PyTorch distributed applications tutorial. This code runs perfectly with a gloo backend, but when I replace it with an nccl backend, the code either hangs on the call to init_process_group or crashes with the following stack trace:
File "multi_proc_test.py", line 17, in init_process dist.init_process_group(backend, rank=rank, world_size=size) File "my_path/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "my_path/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: Broken pipe
If it helps, I have confirmed that torch.distributed.is_nccl_available() returns true.
Any ideas why the code is failing in these places when I use the gloo/nccl backends?
Thank you in advance for your help!