P2P disbale not working

Hey Folks,

The machine I have contains 8 GPUs, but they don’t support P2P. I’d like to run a job disabling P2P. To do that, I set the NCCL_P2P_DISABLE env. variable to 1. However, pytorch/nccl doesn’t honor the setting and throws P2P not supported error. To my knowledge, the code does not explicitly unset this env. variable. What could be the possible reasons for this behavior?

Steps to reproduce bug

  1. Pull image: nvcr.io/nvidia/nemo-rl:v0.4.0

  2. Run docker: docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo-rl:v0.4.0 /bin/bash

  3. Execute the following command

uv run python examples/run_dpo.py \
  policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
  policy.train_global_batch_size=256 \
  cluster.gpus_per_node=8

Additional Information
Machine type : g6.48xlarge instance.

(base) [ec2-user@ip-X-X-X-X ~]$ nvidia-smi topo -p2p r
 	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	
 GPU0	X	NS	NS	NS	NS	NS	NS	NS	
 GPU1	NS	X	NS	NS	NS	NS	NS	NS	
 GPU2	NS	NS	X	NS	NS	NS	NS	NS	
 GPU3	NS	NS	NS	X	NS	NS	NS	NS	
 GPU4	NS	NS	NS	NS	X	NS	NS	NS	
 GPU5	NS	NS	NS	NS	NS	X	NS	NS	
 GPU6	NS	NS	NS	NS	NS	NS	X	NS	
 GPU7	NS	NS	NS	NS	NS	NS	NS	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

Stack Trace:

(DTensorPolicyWorker pid=123067) Initializing DTensorPolicyWorker with is_vlm=False [repeated 7x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Loading model meta-llama/Llama-3.1-8B-Instruct on CPU... [repeated 8x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Initializing empty model for FSDP... [repeated 7x across cluster]
(DTensorPolicyWorker pid=123067) [Rank 7] Loading state dict from rank 0... [repeated 6x across cluster]
(DTensorPolicyWorker pid=122863) /opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:859: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`.
(DTensorPolicyWorker pid=122863)   warnings.warn(
Traceback (most recent call last):
  File "/opt/nemo-rl/examples/run_dpo.py", line 294, in <module>
    main()
  File "/opt/nemo-rl/examples/run_dpo.py", line 278, in main
    ) = setup(config, tokenizer, train_dataset, val_dataset)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/nemo_rl/algorithms/dpo.py", line 252, in setup
    policy.print_node_ip_and_gpu_id()
  File "/opt/nemo-rl/nemo_rl/models/policy/lm_policy.py", line 784, in print_node_ip_and_gpu_id
    results = ray.get(
              ^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 932, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::lm_policy-0-7:DTensorPolicyWorker.__init__() (pid=123067, ip=172.17.0.2, actor_id=3cd94e421f161668c11abd2901000000, repr=DTensorPolicyWorker[rank=7])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/nemo_rl/models/policy/dtensor_policy_worker.py", line 356, in __init__
    set_model_state_dict(
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 1277, in set_model_state_dict
    return _load_model_state_dict(model, model_state_dict, info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 590, in _load_model_state_dict
    _broadcast_state_dict(
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/_state_dict_utils.py", line 614, in _broadcast_state_dict
    dist.broadcast_object_list(broadcast_list, src=0, group=pg)
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3483, in broadcast_object_list
    broadcast(object_sizes_tensor, src=global_src, group=group)
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2714, in broadcast
    work = group.broadcast([tensor], opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'

Did you check the stacktrace to see which part of the code raises this error? If so, could you post it here?

Updated the post with stacktrace. It appears to happen when loading/broadcasting the model from rank 0 to workers in other ranks.

Thank you! Did you export the env variable or are you trying to set it inside the script? (Try to export it in case you are using the latter approach)

Also, did NCCL_DEBUG=INFO yield any additional information about the failure?

I see the same issue when setting NCCL_P2P_DISABLE variable both ways.

Okay, setting NCCL_DEBUG=INFO revealed an issue with aws-ofi-nccl initialization.

(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Using Libfabric version 1.22
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Using CUDA driver version 12090 with runtime 12090
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Configuring AWS-specific options
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Setting provider_filter to efa
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us
(DTensorPolicyWorker pid=23138)
(DTensorPolicyWorker pid=23138) [2025-12-29 06:19:56] b09fbab8a316:23138:24256 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: No data available
(DTensorPolicyWorker pid=23138)
(DTensorPolicyWorker pid=23138) [2025-12-29 06:19:56] b09fbab8a316:23138:24256 [0] nccl_net_ofi_create_plugin:262 NCCL WARN NET/OFI Unable to find a protocol that worked.  Failing initialization.
(DTensorPolicyWorker pid=23138)
(DTensorPolicyWorker pid=23138) [2025-12-29 06:19:56] b09fbab8a316:23138:24256 [0] nccl_net_ofi_create_plugin:335 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
(DTensorPolicyWorker pid=23138)
(DTensorPolicyWorker pid=23138) [2025-12-29 06:19:56] b09fbab8a316:23138:24256 [0] nccl_net_ofi_init:155 NCCL WARN NET/OFI Initializing plugin failed
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO plugin/net/net_v9.cc:57 -> 3
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/IB : No device found.
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO Using network Socket
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO ncclCommInitRankConfig comm 0x2a6e20f0 rank 6 nranks 8 cudaDev 0 nvmlDev 6 busId b2000 commId 0xac409ebab2fc9c4 - Init START
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO Bootstrap timings total 0.002235 (create 0.000078, send 0.000105, recv 0.001098, ring 0.000174, delay 0.000000)
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 1.
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO NVLS multicast support is not available on dev 0
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO comm 0x2a6e20f0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24256 [0] NCCL INFO P2P Chunksize set to 131072
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24279 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 188
(DTensorPolicyWorker pid=23138) b09fbab8a316:23138:24270 [0] NCCL INFO [Proxy Service] Device 0 CPU core 80

I’m looking into resolving this, but if you have any suggestion, please share.

As I’m trying to get the job running on a single machine, I set FI_PROVIDER=sockets rather than efa, and still see the same peer access not supported error.

Update:

Logs pointed to NCCL INFO Channel 00: 0[0] → 1[1] via SHM/direct/direct when I observe p2p not supported error after setting FI_PROVIDER=sockets and NCCL_SOCKET_IFNAME.

So, I tried setting NCCL_SHM_DISABLE to 1 to force the communication through sockets. This worked, but I have couple of questions

  1. What could cause P2P error when GPUs try to communicate through shared memory?
  2. Is communication through sockets faster or slower compared to shared memory? If anyone has compared between the two please share your experience.