Hi, I have a problem with using torch.distributed with CUDA 11.4 environment(Nvidia rtx A6000).
It seems that torch.distributed and DDP don’t work properly. I test with simple code below.
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
import os
def get_local_rank() -> int:
return int(os.environ.get('LOCAL_RANK', 0))
def init_distributed() -> bool:
world_size = int(os.environ.get('WORLD_SIZE', 1))
distributed = world_size > 1
if distributed:
backend = 'nccl' if torch.cuda.is_available() else 'gloo'
dist.init_process_group(backend=backend, init_method='env://')
if backend == 'nccl':
print ('set nccl correctly!')
torch.cuda.set_device(get_local_rank())
else:
logging.warning('Running on CPU only!')
assert torch.distributed.is_initialized()
return distributed
dist_stat=init_distributed()
local_rank = get_local_rank()
print ('hi',local_rank)
dist.barrier(device_ids=[get_local_rank()])
print ('bye',local_rank)
Running above code with command as below,
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -u -m torch.distributed.run --nnodes=1 --nproc_per_node=gpu test_dist.py
then I got message:
``
set nccl correctly!
set nccl correctly!set nccl correctly!
set nccl correctly!
hi 0
hi 1
hi 2
hi 3
``
And stuck at dist.barrier() forever. However, when I use only two gpus, everything is fine. What should I look for in order to solve the problem?
The version of Python is 3.9.7 and torch is 1.10.1+cu113