Hi, the community fellows,
I have encountered an issue when training PyTorch models using slurm with multiple GPUs on a single node:
On my local PC with slurm, it seems that once I call barrier(), the non-0-rank process will stop after the first barrier(). But if I set it to use one process, the program can run.
Interestingly, on real HPC with slurm, I cannot even finish the first barrier(), the 0-rank process(also non-0 rank) will stop at barrier(). This holds even I set to use one process.
The same happens with CPU mode.
Could anyone help with this? Thanks a lot!
The following is the minimal code the reproduce:
test.py
import os
import hostlist
import torch
import torch.distributed as dist
use_gpu = False
gpu_id = 0
device = None
distributed = False
dist_rank = 0
world_size = 1
def set_gpu_mode(mode):
global use_gpu
global device
global gpu_id
global distributed
global dist_rank
global world_size
gpu_id = int(os.environ.get("SLURM_LOCALID", 0))
dist_rank = int(os.environ.get("SLURM_PROCID", 0))
world_size = int(os.environ.get("SLURM_NTASKS", 1))
distributed = world_size > 1
use_gpu = mode
device = torch.device(f"cuda:{gpu_id}" if use_gpu else "cpu")
torch.backends.cudnn.benchmark = True
def init_process(backend="nccl"):
print(device, dist_rank, world_size)
print(f"Starting process with rank {dist_rank}...", flush=True)
if "SLURM_STEPS_GPUS" in os.environ:
gpu_ids = os.environ["SLURM_STEP_GPUS"].split(",")
os.environ["MASTER_PORT"] = str(12345 + int(min(gpu_ids)))
else:
os.environ["MASTER_PORT"] = str(12345)
if "SLURM_JOB_NODELIST" in os.environ:
hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"])
print(hostnames)
os.environ["MASTER_ADDR"] = hostnames[0]
else:
os.environ["MASTER_ADDR"] = "127.0.0.1"
dist.init_process_group(
backend,
rank=dist_rank,
world_size=world_size,
)
print(f"Process {dist_rank} is connected.", flush=True)
dist.barrier()
print('check', flush=True)
if dist_rank == 0:
print(f"All processes are connected.", flush=True)
set_gpu_mode(False)
init_process()
slurm bash file:
#!/bin/bash
#SBATCH --job-name=slurm_test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --time=48:00:00
# cleaning modules launched during interactive mode
. /home/cylu/anaconda3/etc/profile.d/conda.sh
conda activate segmenter
export NCCL_DEBUG=INFO
srun python test.py
echo 'Done'
output:
cuda:1 1 2
Starting process with rank 1...
cuda:0 0 2
Starting process with rank 0...
['cylu-pc']
Process 0 is connected.
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
['cylu-pc']
Process 1 is connected.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
cylu-pc:80965:80965 [0] NCCL INFO Bootstrap : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80965:80965 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cylu-pc:80965:80965 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cylu-pc:80965:80965 [0] NCCL INFO NET/Socket : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80965:80965 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
cylu-pc:80966:80966 [1] NCCL INFO Bootstrap : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80966:80966 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cylu-pc:80966:80966 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cylu-pc:80966:80966 [1] NCCL INFO NET/Socket : Using [0]enp5s0:131.155.125.194<0>
cylu-pc:80966:80966 [1] NCCL INFO Using network Socket
cylu-pc:80965:81006 [0] NCCL INFO Channel 00/02 : 0 1
cylu-pc:80965:81006 [0] NCCL INFO Channel 01/02 : 0 1
cylu-pc:80966:81007 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
cylu-pc:80966:81007 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
cylu-pc:80966:81007 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff00
cylu-pc:80965:81006 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
cylu-pc:80965:81006 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
cylu-pc:80965:81006 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
cylu-pc:80965:81006 [0] NCCL INFO Channel 00 : 0[9000] -> 1[42000] via P2P/IPC
cylu-pc:80966:81007 [1] NCCL INFO Channel 00 : 1[42000] -> 0[9000] via P2P/IPC
cylu-pc:80965:81006 [0] NCCL INFO Channel 01 : 0[9000] -> 1[42000] via P2P/IPC
cylu-pc:80966:81007 [1] NCCL INFO Channel 01 : 1[42000] -> 0[9000] via P2P/IPC
cylu-pc:80965:81006 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cylu-pc:80965:81006 [0] NCCL INFO comm 0x7f6b30001060 rank 0 nranks 2 cudaDev 0 busId 9000 - Init COMPLETE
cylu-pc:80965:80965 [0] NCCL INFO Launch mode Parallel
cylu-pc:80966:81007 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cylu-pc:80966:81007 [1] NCCL INFO comm 0x7fed78001060 rank 1 nranks 2 cudaDev 1 busId 42000 - Init COMPLETE
check
All processes are connected.
As you can see, only one ‘check’ is printed from 0-rank process. The other process cannot print ‘check’.