I tried to use DistributedDataParallel
for multi-GPUs training on one node and my code works fine on Host-A
but when I run my code on Host-B
, it is stuck by the torch.distributed.barrier()
after torch.distributed.init_process_group
. And the GPU-Util are both 100%.
Here is my code for initialize the DistributedDataParallel
:
args.local_rank = int(os.environ['LOCAL_RANK'])
print(f"Local Rank {args.local_rank} | Using distributed mode by DistributedDataParallel")
torch.distributed.init_process_group(backend='nccl',
world_size=args.world_size, rank=args.local_rank)
print('before barrier')
torch.distributed.barrier()
print('after barrier')
On Host-B
, when run the following command which is the same on Host-A
:
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 test_ddp.py
And it got the following output, both two processes are stuck by the torch.distributed.barrier()
, The GPU-Util of two GPUs are both 100%.
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING]
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] *****************************************
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] *****************************************
Local Rank 0 | Using distributed mode by DistributedDataParallel
Local Rank 1 | Using distributed mode by DistributedDataParallel
before barrier
before barrier
But the same code works fine on Host-A
. Is there any problem with my code? Or Is there a problem with Host-B
?
The information of Host-A
and Host-B
is listed bellow:
Host-A
: Installed with 2 RTX-4090 GPUs, and the driver version is 535.129.03 and CUDA version is 12.2
Host-B
: Installed with 8 RTX-3090 GPUs, and the driver version is 545.23.08 and CUDA version is 12.3
But for the Host-B
, in the /urs/local
, serval different versions of CUDA are installed but CUDA-12 is not installed. Is this the problem with Host-B
?
The version of
Pytorch
and Python
is the same on two Hosts.