mp.spawn on slurm

I have a problem running the spawn function from mp on Slurm on multiple GPUs.

Instructions To Reproduce the Issue:

  1. Full runnable code:
import torch, os

def test_nccl_ops():
    num_gpu = 2
    print("NCCL init before spawn")
    import torch.multiprocessing as mp
    dist_url = "file:///tmp/nccl_tmp_file"
    mp.spawn(_test_nccl_worker, nprocs=num_gpu, args=(num_gpu, dist_url), daemon=False)
    print("NCCL init succeeded.")


def _test_nccl_worker(rank, num_gpu, dist_url):
    import torch.distributed as dist

    dist.init_process_group(backend="NCCL", init_method=dist_url, rank=rank, world_size=num_gpu)
    dist.barrier()
    print("Worker after barrier")

if __name__ == "__main__":
    test_nccl_ops()

On the other hand, we implemented this Slurm script to run an experiment on 2 GPUs:

#!/bin/bash -l

#SBATCH --account=Account
#SBATCH --partition=gpu # gpu partition
#SBATCH --nodes=1 # 1 node, 4 GPUs per node
#SBATCH --time=24:00:00 
#SBATCH --job-name=detectron2_demo4 # job name



module load Python/3.9.5-GCCcore-10.3.0
module load CUDA/11.1.1-GCC-10.2.0

cd /experiment_path

export NCCL_DEBUG=INFO

srun python main.py --num-gpus 2

When I ran this script I faced an error (cat slurm-xxx.out), and no error file:

The following have been reloaded with a version change:
  1) GCCcore/10.3.0 => GCCcore/10.2.0
  2) binutils/2.36.1-GCCcore-10.3.0 => binutils/2.35-GCCcore-10.2.0
  3) zlib/1.2.11-GCCcore-10.3.0 => zlib/1.2.11-GCCcore-10.2.0

NCCL init before spawn
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
gpu04:9770:9770 [0] NCCL INFO Bootstrap : Using [0]bond0:10.10.1.4<0>
gpu04:9770:9770 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu04:9770:9770 [0] NCCL INFO NET/IB : No device found.
gpu04:9770:9770 [0] NCCL INFO NET/Socket : Using [0]bond0:10.10.1.4<0>
gpu04:9770:9770 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
gpu04:9771:9771 [1] NCCL INFO Bootstrap : Using [0]bond0:10.10.1.4<0>
gpu04:9771:9771 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu04:9771:9771 [1] NCCL INFO NET/IB : No device found.
gpu04:9771:9771 [1] NCCL INFO NET/Socket : Using [0]bond0:10.10.1.4<0>
gpu04:9771:9771 [1] NCCL INFO Using network Socket
gpu04:9771:9862 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
gpu04:9771:9862 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
gpu04:9771:9862 [1] NCCL INFO Setting affinity for GPU 1 to 3fff
gpu04:9770:9861 [0] NCCL INFO Channel 00/02 :    0   1
gpu04:9770:9861 [0] NCCL INFO Channel 01/02 :    0   1
gpu04:9770:9861 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
gpu04:9770:9861 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
gpu04:9770:9861 [0] NCCL INFO Setting affinity for GPU 0 to 3fff
gpu04:9771:9862 [1] NCCL INFO Channel 00 : 1[6000] -> 0[5000] via P2P/IPC
gpu04:9770:9861 [0] NCCL INFO Channel 00 : 0[5000] -> 1[6000] via P2P/IPC
gpu04:9771:9862 [1] NCCL INFO Channel 01 : 1[6000] -> 0[5000] via P2P/IPC
gpu04:9770:9861 [0] NCCL INFO Channel 01 : 0[5000] -> 1[6000] via P2P/IPC
gpu04:9771:9862 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gpu04:9771:9862 [1] NCCL INFO comm 0x7f057c000e00 rank 1 nranks 2 cudaDev 1 busId 6000 - Init COMPLETE
gpu04:9770:9861 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gpu04:9770:9861 [0] NCCL INFO comm 0x7f5210000e00 rank 0 nranks 2 cudaDev 0 busId 5000 - Init COMPLETE
gpu04:9770:9770 [0] NCCL INFO Launch mode Parallel


Expected behavior:

To run training on 2 GPUs and print other more outputs then “NCCL init before spawn” and NCCL debug info.

Environment:

Paste the output of the following command:

No CUDA runtime is found, using CUDA_HOME='/usr/local/software/CUDAcore/11.1.1'
---------------------  --------------------------------------------------------------------------------
sys.platform           linux
Python                 3.9.5 (default, Jul  9 2021, 09:35:24) [GCC 10.3.0]
numpy                  1.21.1
detectron2             0.5 @/home/users/aimhigh/detectron2/detectron2
Compiler               GCC 10.2
CUDA compiler          CUDA 11.1
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.9.0+cu102 @/home/users/aimhigh/.local/lib/python3.9/site-packages/torch
PyTorch debug build    False
GPU available          No: torch.cuda.is_available() == False
Pillow                 8.3.1
torchvision            0.10.0+cu102 @/home/users/aimhigh/.local/lib/python3.9/site-packages/torchvision
fvcore                 0.1.5.post20210727
iopath                 0.1.9
cv2                    4.5.3
---------------------  --------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2

Additional note: the first time I assumed it is a detectron2 problem but it’s not. You can find my previous discussion with detectron2 developers: link.