After further investigation the problem was due to NCCL backend trying to use peer to peer (P2P) transport.
Forcing NCCL_P2P_DISABLE=1 fixed the issue ![]()
4 Likes
After further investigation the problem was due to NCCL backend trying to use peer to peer (P2P) transport.
Forcing NCCL_P2P_DISABLE=1 fixed the issue ![]()