I have been trying to the MoCo code from Facebook Research on a machine with 4 GPUs but have been consistently receiving SIGKILLs.
If I run this command (which uses the nccl backend by default),
python main_moco.py -a resnet50 --lr 0.015 --batch-size 128 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 --mlp --moco-t 0.2 --aug-plus --cos
I get a SIGKILL as follows:
File "main_moco.py", line 406, in <module>
main()
File "main_moco.py", line 130, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 2 terminated with signal SIGKILL
With some simple print statements, I have pinpointed this issue to the init_process_group call here.
I also tried to run the code with the gloo backend. Now the init_process_group call succeeds, but my code fails with a SIGKILL just a little while later in the call to model.cuda().
To narrow in on the issue a little more, I tried running the example code in the Setup section of the PyTorch distributed applications tutorial. This code runs perfectly with a gloo backend, but when I replace it with an nccl backend, the code either hangs on the call to init_process_group or crashes with the following stack trace:
File "multi_proc_test.py", line 17, in init_process
dist.init_process_group(backend, rank=rank, world_size=size)
File "my_path/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "my_path/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
If it helps, I have confirmed that torch.distributed.is_nccl_available() returns true.
Any ideas why the code is failing in these places when I use the gloo/nccl backends?
Thank you in advance for your help!