SIGKILL when Running Distributed Applications

cotton_candy · March 4, 2021, 5:57am

I have been trying to the MoCo code from Facebook Research on a machine with 4 GPUs but have been consistently receiving SIGKILLs.

If I run this command (which uses the nccl backend by default),

python main_moco.py -a resnet50 --lr 0.015 --batch-size 128 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 --mlp --moco-t 0.2 --aug-plus --cos

I get a SIGKILL as follows:

File "main_moco.py", line 406, in <module>
    main()
  File "main_moco.py", line 130, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "my_path/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 2 terminated with signal SIGKILL

With some simple print statements, I have pinpointed this issue to the init_process_group call here.

I also tried to run the code with the gloo backend. Now the init_process_group call succeeds, but my code fails with a SIGKILL just a little while later in the call to model.cuda().

To narrow in on the issue a little more, I tried running the example code in the Setup section of the PyTorch distributed applications tutorial. This code runs perfectly with a gloo backend, but when I replace it with an nccl backend, the code either hangs on the call to init_process_group or crashes with the following stack trace:

File "multi_proc_test.py", line 17, in init_process
    dist.init_process_group(backend, rank=rank, world_size=size)
  File "my_path/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "my_path/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: Broken pipe

If it helps, I have confirmed that torch.distributed.is_nccl_available() returns true.

Any ideas why the code is failing in these places when I use the gloo/nccl backends?

Thank you in advance for your help!

ptrblck · March 5, 2021, 6:04am

The broken pipe error could be raised if one process died unexpectedly.
You could rerun the minimal example with NCCL_DEBUG=INFO and check, if NCCL detects any errors.