Process stuck by the dist.barrier() using DDP after dist.init_process_group

I tried to use DistributedDataParallel for multi-GPUs training on one node and my code works fine on Host-A but when I run my code on Host-B, it is stuck by the torch.distributed.barrier() after torch.distributed.init_process_group. And the GPU-Util are both 100%.

Here is my code for initialize the DistributedDataParallel:

args.local_rank = int(os.environ['LOCAL_RANK'])

print(f"Local Rank {args.local_rank} | Using distributed mode by DistributedDataParallel")

torch.distributed.init_process_group(backend='nccl',
                                     world_size=args.world_size, rank=args.local_rank)

print('before barrier')

torch.distributed.barrier()

print('after barrier')

On Host-B, when run the following command which is the same on Host-A:

CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 test_ddp.py

And it got the following output, both two processes are stuck by the torch.distributed.barrier(), The GPU-Util of two GPUs are both 100%.

[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] 
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] *****************************************
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] *****************************************
Local Rank 0 | Using distributed mode by DistributedDataParallel
Local Rank 1 | Using distributed mode by DistributedDataParallel
before barrier
before barrier

But the same code works fine on Host-A. Is there any problem with my code? Or Is there a problem with Host-B?

The information of Host-A and Host-B is listed bellow:

Host-A: Installed with 2 RTX-4090 GPUs, and the driver version is 535.129.03 and CUDA version is 12.2

Host-B: Installed with 8 RTX-3090 GPUs, and the driver version is 545.23.08 and CUDA version is 12.3
But for the Host-B, in the /urs/local , serval different versions of CUDA are installed but CUDA-12 is not installed. Is this the problem with Host-B?


The version of Pytorch and Python is the same on two Hosts.