Hey Folks,
The machine I have contains 8 GPUs, but they don’t support P2P. I’d like to run a job disabling P2P. To do that, I set the NCCL_P2P_DISABLE env. variable to 1. However, pytorch/nccl doesn’t honor the setting and throws P2P not supported error. To my knowledge, the code does not explicitly unset this env. variable. What could be the possible reasons for this behavior?
Steps to reproduce bug
-
Pull image: nvcr.io/nvidia/nemo-rl:v0.4.0
-
Run docker:
docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo-rl:v0.4.0 /bin/bash -
Execute the following command
uv run python examples/run_dpo.py \
policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
policy.train_global_batch_size=256 \
cluster.gpus_per_node=8
Additional Information
Machine type : g6.48xlarge instance.
(base) [ec2-user@ip-X-X-X-X ~]$ nvidia-smi topo -p2p r
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NS NS NS NS NS NS NS
GPU1 NS X NS NS NS NS NS NS
GPU2 NS NS X NS NS NS NS NS
GPU3 NS NS NS X NS NS NS NS
GPU4 NS NS NS NS X NS NS NS
GPU5 NS NS NS NS NS X NS NS
GPU6 NS NS NS NS NS NS X NS
GPU7 NS NS NS NS NS NS NS X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
Stack Trace:
(DTensorPolicyWorker pid=123067) Initializing DTensorPolicyWorker with is_vlm=False [repeated 7x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Loading model meta-llama/Llama-3.1-8B-Instruct on CPU... [repeated 8x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Initializing empty model for FSDP... [repeated 7x across cluster]
(DTensorPolicyWorker pid=123067) [Rank 7] Loading state dict from rank 0... [repeated 6x across cluster]
(DTensorPolicyWorker pid=122863) /opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:859: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`.
(DTensorPolicyWorker pid=122863) warnings.warn(
Traceback (most recent call last):
File "/opt/nemo-rl/examples/run_dpo.py", line 294, in <module>
main()
File "/opt/nemo-rl/examples/run_dpo.py", line 278, in main
) = setup(config, tokenizer, train_dataset, val_dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/nemo_rl/algorithms/dpo.py", line 252, in setup
policy.print_node_ip_and_gpu_id()
File "/opt/nemo-rl/nemo_rl/models/policy/lm_policy.py", line 784, in print_node_ip_and_gpu_id
results = ray.get(
^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 932, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::lm_policy-0-7:DTensorPolicyWorker.__init__() (pid=123067, ip=172.17.0.2, actor_id=3cd94e421f161668c11abd2901000000, repr=DTensorPolicyWorker[rank=7])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/nemo_rl/models/policy/dtensor_policy_worker.py", line 356, in __init__
set_model_state_dict(
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 1277, in set_model_state_dict
return _load_model_state_dict(model, model_state_dict, info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 590, in _load_model_state_dict
_broadcast_state_dict(
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/_state_dict_utils.py", line 614, in _broadcast_state_dict
dist.broadcast_object_list(broadcast_list, src=0, group=pg)
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3483, in broadcast_object_list
broadcast(object_sizes_tensor, src=global_src, group=group)
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2714, in broadcast
work = group.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'