Barrier() function makes the process stop when using slurm

Hi, the community fellows,

I have encountered an issue when training PyTorch models using slurm with multiple GPUs on a single node:
On my local PC with slurm, it seems that once I call barrier(), the non-0-rank process will stop after the first barrier(). But if I set it to use one process, the program can run.
Interestingly, on real HPC with slurm, I cannot even finish the first barrier(), the 0-rank process(also non-0 rank) will stop at barrier(). This holds even I set to use one process.
The same happens with CPU mode.

Could anyone help with this? Thanks a lot!

The following is the minimal code the reproduce:
test.py

import os
import hostlist
import torch
import torch.distributed as dist

use_gpu = False
gpu_id = 0
device = None

distributed = False
dist_rank = 0
world_size = 1


def set_gpu_mode(mode):
    global use_gpu
    global device
    global gpu_id
    global distributed
    global dist_rank
    global world_size
    gpu_id = int(os.environ.get("SLURM_LOCALID", 0))
    dist_rank = int(os.environ.get("SLURM_PROCID", 0))
    world_size = int(os.environ.get("SLURM_NTASKS", 1))

    distributed = world_size > 1
    use_gpu = mode
    device = torch.device(f"cuda:{gpu_id}" if use_gpu else "cpu")
    torch.backends.cudnn.benchmark = True


def init_process(backend="nccl"):
    print(device, dist_rank, world_size)
    print(f"Starting process with rank {dist_rank}...", flush=True)

    if "SLURM_STEPS_GPUS" in os.environ:
        gpu_ids = os.environ["SLURM_STEP_GPUS"].split(",")
        os.environ["MASTER_PORT"] = str(12345 + int(min(gpu_ids)))
    else:
        os.environ["MASTER_PORT"] = str(12345)

    if "SLURM_JOB_NODELIST" in os.environ:
        hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"])
        print(hostnames)
        os.environ["MASTER_ADDR"] = hostnames[0]
    else:
        os.environ["MASTER_ADDR"] = "127.0.0.1"

    dist.init_process_group(
        backend,
        rank=dist_rank,
        world_size=world_size,
    )
    print(f"Process {dist_rank} is connected.", flush=True)
    dist.barrier()
    print('check', flush=True)
    if dist_rank == 0:
        print(f"All processes are connected.", flush=True)


set_gpu_mode(False)
init_process()

slurm bash file:

#!/bin/bash
#SBATCH --job-name=slurm_test

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16


#SBATCH --time=48:00:00

# cleaning modules launched during interactive mode
. /home/cylu/anaconda3/etc/profile.d/conda.sh
conda activate segmenter
export NCCL_DEBUG=INFO

srun python test.py
echo 'Done'

output:

cuda:1 1 2
Starting process with rank 1...
cuda:0 0 2
Starting process with rank 0...
['cylu-pc']
Process 0 is connected.
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
['cylu-pc']
Process 1 is connected.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
cylu-pc:80965:80965 [0] NCCL INFO Bootstrap : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80965:80965 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cylu-pc:80965:80965 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cylu-pc:80965:80965 [0] NCCL INFO NET/Socket : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80965:80965 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
cylu-pc:80966:80966 [1] NCCL INFO Bootstrap : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80966:80966 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cylu-pc:80966:80966 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cylu-pc:80966:80966 [1] NCCL INFO NET/Socket : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80966:80966 [1] NCCL INFO Using network Socket
cylu-pc:80965:81006 [0] NCCL INFO Channel 00/02 :    0   1
cylu-pc:80965:81006 [0] NCCL INFO Channel 01/02 :    0   1
cylu-pc:80966:81007 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
cylu-pc:80966:81007 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
cylu-pc:80966:81007 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff00
cylu-pc:80965:81006 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
cylu-pc:80965:81006 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
cylu-pc:80965:81006 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
cylu-pc:80965:81006 [0] NCCL INFO Channel 00 : 0[9000] -> 1[42000] via P2P/IPC
cylu-pc:80966:81007 [1] NCCL INFO Channel 00 : 1[42000] -> 0[9000] via P2P/IPC
cylu-pc:80965:81006 [0] NCCL INFO Channel 01 : 0[9000] -> 1[42000] via P2P/IPC
cylu-pc:80966:81007 [1] NCCL INFO Channel 01 : 1[42000] -> 0[9000] via P2P/IPC
cylu-pc:80965:81006 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cylu-pc:80965:81006 [0] NCCL INFO comm 0x7f6b30001060 rank 0 nranks 2 cudaDev 0 busId 9000 - Init COMPLETE
cylu-pc:80965:80965 [0] NCCL INFO Launch mode Parallel
cylu-pc:80966:81007 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cylu-pc:80966:81007 [1] NCCL INFO comm 0x7fed78001060 rank 1 nranks 2 cudaDev 1 busId 42000 - Init COMPLETE
check
All processes are connected.

As you can see, only one ‘check’ is printed from 0-rank process. The other process cannot print ‘check’.

@Chenyang-Lu looking into the script you shared, it seems srun won’t help allocate multiple processes automatically, you need to either use distributed launch facilities (i.e. Distributed communication package - torch.distributed — PyTorch 1.12 documentation) or manually launch multiple processes for your script and set the cuda device for each process accordingly when using NCCL process group.