An error in ibv (i.e., InfiniBand verbs) indicates problems with GPU Direct, which NCCL tries to use for RDMA but which Gloo doesn’t. You can try to confirm that this is indeed the issue by running with the NCCL_IB_DISABLE=1
env var. That may work but would probably end up being slower. In that case you might want to follow the instructions here to troubleshoot InfiniBand issues: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-direct
3 Likes