I’m stalling on any kind of barrier between two GPUs in a minimal example with the nvidia pytorch container nvcr.io/nvidia/pytorch:25.03-py3
and there’s nothing really jumping out at me in the NCCL logs to a layman like me. I saw an older post that they had problems with the driver, is this probably the case for me? I initially used 570.86, and bumped up to 570.124. I’m running on bare metal kubernetes cluster with ubuntu 24.04, so kinda restricted to driver 570, and have to wait for 575 to be published in container form.
Edit: I also started with nvcr.io/nvidia/pytorch:25.02-py3
and tried the bump to 03
to fix this. My normal training code used to always be fine, and I didn’t notice any changes in the DDP training docs.
torchrun --nproc-per-node=gpu --standalone /mnt/user/test-ddp.py
import os
import torch
import torch.distributed as dist
if __name__ == "__main__":
torch.cuda.set_device(f"cuda:{os.environ.get('LOCAL_RANK', 0)}")
dist.init_process_group(backend="nccl")
rank = dist.get_rank()
print(f"Rank of the current process: {rank}")
dist.barrier()
print("All processes have reached the barrier.")
dist.destroy_process_group()
W0528 12:31:53.174000 1 torch/distributed/run.py:763]
W0528 12:31:53.174000 1 torch/distributed/run.py:763] *****************************************
W0528 12:31:53.174000 1 torch/distributed/run.py:763] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0528 12:31:53.174000 1 torch/distributed/run.py:763] *****************************************
Rank of the current process: 0
Rank of the current process: 1
[rank1]:[W528 12:31:54.339872954 ProcessGroupNCCL.cpp:4782] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank0]:[W528 12:31:54.349033170 ProcessGroupNCCL.cpp:4782] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
ddp-deb-train-0:90:90 [0] NCCL INFO Bootstrap: Using eth0:192.168.19.109<0>
ddp-deb-train-0:90:90 [0] NCCL INFO cudaDriverVersion 12080
ddp-deb-train-0:90:90 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
ddp-deb-train-0:90:90 [0] NCCL INFO Comm config Blocking set to 1
ddp-deb-train-0:91:91 [1] NCCL INFO cudaDriverVersion 12080
ddp-deb-train-0:91:91 [1] NCCL INFO Bootstrap: Using eth0:192.168.19.109<0>
ddp-deb-train-0:91:91 [1] NCCL INFO NCCL version 2.25.1+cuda12.8
ddp-deb-train-0:91:91 [1] NCCL INFO Comm config Blocking set to 1
ddp-deb-train-0:90:105 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v9 (v9)
ddp-deb-train-0:90:105 [0] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v9)
ddp-deb-train-0:90:105 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
ddp-deb-train-0:90:105 [0] NCCL INFO P2P plugin v9 IBext_v9
ddp-deb-train-0:90:105 [0] NCCL INFO NET/IB : No device found.
ddp-deb-train-0:90:105 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:192.168.19.109<0>
ddp-deb-train-0:90:105 [0] NCCL INFO NET/IB : No device found.
ddp-deb-train-0:90:105 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:192.168.19.109<0>
ddp-deb-train-0:90:105 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.19.109<0>
ddp-deb-train-0:90:105 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
ddp-deb-train-0:90:105 [0] NCCL INFO Using network Socket
ddp-deb-train-0:90:105 [0] NCCL INFO ncclCommInitRankConfig comm 0x25bb5550 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0x1d0b7e730eb15adf - Init START
ddp-deb-train-0:91:106 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v9 (v9)
ddp-deb-train-0:91:106 [1] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v9)
ddp-deb-train-0:91:106 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
ddp-deb-train-0:91:106 [1] NCCL INFO P2P plugin v9 IBext_v9
ddp-deb-train-0:91:106 [1] NCCL INFO NET/IB : No device found.
ddp-deb-train-0:91:106 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:192.168.19.109<0>
ddp-deb-train-0:91:106 [1] NCCL INFO NET/IB : No device found.
ddp-deb-train-0:91:106 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:192.168.19.109<0>
ddp-deb-train-0:91:106 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.19.109<0>
ddp-deb-train-0:91:106 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
ddp-deb-train-0:91:106 [1] NCCL INFO Using network Socket
ddp-deb-train-0:91:106 [1] NCCL INFO ncclCommInitRankConfig comm 0xaa2ec70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId e1000 commId 0x1d0b7e730eb15adf - Init START
ddp-deb-train-0:91:106 [1] NCCL INFO RAS client listening socket at ::1<28028>
ddp-deb-train-0:90:105 [0] NCCL INFO RAS client listening socket at ::1<28028>
ddp-deb-train-0:91:106 [1] NCCL INFO Bootstrap timings total 0.001023 (create 0.000037, send 0.000134, recv 0.000426, ring 0.000034, delay 0.000000)
ddp-deb-train-0:90:105 [0] NCCL INFO Bootstrap timings total 0.009177 (create 0.000035, send 0.000137, recv 0.008375, ring 0.000030, delay 0.000000)
ddp-deb-train-0:91:106 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
ddp-deb-train-0:90:105 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
ddp-deb-train-0:91:106 [1] NCCL INFO comm 0xaa2ec70 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
ddp-deb-train-0:90:105 [0] NCCL INFO comm 0x25bb5550 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
ddp-deb-train-0:91:106 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ddp-deb-train-0:90:105 [0] NCCL INFO Channel 00/04 : 0 1
ddp-deb-train-0:91:106 [1] NCCL INFO P2P Chunksize set to 131072
ddp-deb-train-0:90:105 [0] NCCL INFO Channel 01/04 : 0 1
ddp-deb-train-0:90:105 [0] NCCL INFO Channel 02/04 : 0 1
ddp-deb-train-0:90:105 [0] NCCL INFO Channel 03/04 : 0 1
ddp-deb-train-0:90:105 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ddp-deb-train-0:90:105 [0] NCCL INFO P2P Chunksize set to 131072
ddp-deb-train-0:90:105 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
ddp-deb-train-0:90:110 [0] NCCL INFO [Proxy Service] Device 0 CPU core 237
ddp-deb-train-0:91:109 [1] NCCL INFO [Proxy Service] Device 1 CPU core 109
ddp-deb-train-0:91:111 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 219
ddp-deb-train-0:90:112 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 75
ddp-deb-train-0:91:106 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ddp-deb-train-0:91:106 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ddp-deb-train-0:90:105 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ddp-deb-train-0:90:105 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ddp-deb-train-0:90:105 [0] NCCL INFO CC Off, workFifoBytes 1048576
ddp-deb-train-0:90:105 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol.
ddp-deb-train-0:91:106 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol.
ddp-deb-train-0:90:105 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
ddp-deb-train-0:91:106 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
ddp-deb-train-0:90:105 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
ddp-deb-train-0:91:106 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
ddp-deb-train-0:90:105 [0] NCCL INFO ncclCommInitRankConfig comm 0x25bb5550 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0x1d0b7e730eb15adf - Init COMPLETE
ddp-deb-train-0:91:106 [1] NCCL INFO ncclCommInitRankConfig comm 0xaa2ec70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId e1000 commId 0x1d0b7e730eb15adf - Init COMPLETE
ddp-deb-train-0:90:105 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.14 (kernels 0.11, alloc 0.00, bootstrap 0.01, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
ddp-deb-train-0:91:106 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.13 (kernels 0.12, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
ddp-deb-train-0:90:113 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
ddp-deb-train-0:90:113 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
ddp-deb-train-0:90:113 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
ddp-deb-train-0:90:113 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
ddp-deb-train-0:91:114 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
ddp-deb-train-0:91:114 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
ddp-deb-train-0:91:114 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
ddp-deb-train-0:91:114 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
ddp-deb-train-0:91:114 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
ddp-deb-train-0:90:113 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
root@ddp-deb-train-0:/workspace# nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE SYS SYS 64-127,192-255 1 N/A
GPU1 NODE X SYS SYS 64-127,192-255 1 N/A
NIC0 SYS SYS X PIX
NIC1 SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
root@ddp-deb-train-0:/workspace# nvidia-smi
Wed May 28 12:37:59 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 5000 Ada Gene... On | 00000000:81:00.0 Off | Off |
| 30% 40C P2 73W / 250W | 544MiB / 32760MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX 5000 Ada Gene... On | 00000000:E1:00.0 Off | Off |
| 30% 39C P2 75W / 250W | 544MiB / 32760MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 90 C /usr/bin/python 534MiB |
| 1 N/A N/A 91 C /usr/bin/python 534MiB |
+-----------------------------------------------------------------------------------------+