Process stuck by the dist.barrier() using DDP after dist.init_process_group

I tried to use DistributedDataParallel for multi-GPUs training on one node and my code works fine on Host-A but when I run my code on Host-B, it is stuck by the torch.distributed.barrier() after torch.distributed.init_process_group. And the GPU-Util are both 100%.

Here is my code for initialize the DistributedDataParallel:

args.local_rank = int(os.environ['LOCAL_RANK'])

print(f"Local Rank {args.local_rank} | Using distributed mode by DistributedDataParallel")

torch.distributed.init_process_group(backend='nccl',
                                     world_size=args.world_size, rank=args.local_rank)

print('before barrier')

torch.distributed.barrier()

print('after barrier')

On Host-B, when run the following command which is the same on Host-A:

CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 test_ddp.py

And it got the following output, both two processes are stuck by the torch.distributed.barrier(), The GPU-Util of two GPUs are both 100%.

[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] 
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] *****************************************
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-03-09 19:37:32,243] torch.distributed.run: [WARNING] *****************************************
Local Rank 0 | Using distributed mode by DistributedDataParallel
Local Rank 1 | Using distributed mode by DistributedDataParallel
before barrier
before barrier

But the same code works fine on Host-A. Is there any problem with my code? Or Is there a problem with Host-B?

The information of Host-A and Host-B is listed bellow:

Host-A: Installed with 2 RTX-4090 GPUs, and the driver version is 535.129.03 and CUDA version is 12.2

Host-B: Installed with 8 RTX-3090 GPUs, and the driver version is 545.23.08 and CUDA version is 12.3
But for the Host-B, in the /urs/local , serval different versions of CUDA are installed but CUDA-12 is not installed. Is this the problem with Host-B?


The version of Pytorch and Python is the same on two Hosts.

1 Like

I have the same issue on Cuda 12.2 machine. RTX A6000 gpus. The code below can recreate the issue. torchrun --nnodes=1 --nproc_per_node=gpu ddp_test.py --distributed

import argparse  
import os
import torch.distributed as dist

parser = argparse.ArgumentParser()
parser.add_argument("--distributed", action="store_true")
args = parser.parse_args()
args.local_rank = int(os.environ['LOCAL_RANK'])

print(f"Local Rank {args.local_rank} | Using distributed mode by DistributedDataParallel")

dist.init_process_group(backend='nccl')
print(f'before barrier. Rank: {args.local_rank} {dist.get_rank()}. World size {dist.get_world_size()}')
dist.barrier()
print(f'after barrier. Rank: {args.local_rank}')

Logs show that all the ranks reach the barrier and then get stuck.

[2025-01-24 15:35:42,293] torch.distributed.run: [WARNING] 
[2025-01-24 15:35:42,293] torch.distributed.run: [WARNING] *****************************************
[2025-01-24 15:35:42,293] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2025-01-24 15:35:42,293] torch.distributed.run: [WARNING] *****************************************
Local Rank 1 | Using distributed mode by DistributedDataParallel
Local Rank 2 | Using distributed mode by DistributedDataParallel
Local Rank 0 | Using distributed mode by DistributedDataParallel
Local Rank 3 | Using distributed mode by DistributedDataParallel
before barrier. Rank: 1 1. World size 4
before barrier. Rank: 3 3. World size 4
before barrier. Rank: 2 2. World size 4
before barrier. Rank: 0 0. World size 4

Issue turned out to be the NVLink setup between the GPUs on the machine i was using. I had to disable the NVLink usage and instead use PCI. The following exports resolved this for me.

export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=lo