Distributed Training Only Works When InfiniBand Is Disabled

We are trying to run a training script in distributed form through Singularity and Slurm. Here’s a generic version of the script that triggers the code:

srun singularity run --nv \
      -B/files \
      image.sif \
      torchrun \
        --nnodes 3 \
        --nproc_per_node 8 \
        --rdzv_id $RANDOM \
        --rdzv_backend c10d \
        --rdzv_endpoint $head_node_ip:29500 \
        training.py

And here’s how we submit the script above to Slurm:

$ sbatch --nodelist=list_of_nodes_to_use train.slurm

It’s pretty run-of-the-mill stuff for distributed training, I assume, but we are running into the following problem: the scripts above only work when disabling the use of InifiBand via the command export NCCL_IB_DISABLE=1.

When that’s not done, we get the following error:

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)

I assume there’s something missing (a library or specific configuration, perhaps) that will allow NCCL to work properly, but I am not sure what. Can anyone shed some light on this issue?

The image (.sif) we are using has Ubuntu 18.04.6 LTS and torch 1.13.1.

Could you rerun your script with NCCL_DEBUG=INFO and post the logs here?
Also, is this error reproducible in the latest PyTorch release?

Thanks for the reply.

Here’s what I got running the script with NCCL_DEBUG=INFO.

Last error:
Net : Connect to <IP> failed : No route to hostNCCL error in: /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.

There are also some other messages that might be interesting further up on the log. This one:

misc/socket.cc:456 NCCL WARN Net : Connect to <IP> failed : No route to host

And this one:

NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

I will try adding that plugin to the image and get back to you as soon as possible, but I believe we have already tried that solution and got the same error.

As for being reproducible in the latest Pytorch version, I am not sure. We don’t have any images with that Pytorch version, but I can try and get back to you as well.

This is just information, not an error. In the absence of any network plugins, you should be able to fallback to the internal ib or socket implementations. Given this is running on an IB cluster, NET/IB should be kicking in. NCCL_DEBUG=TRACE might be a bit more verbose and point to the issue.

This is what I got with NCCL_DEBUG=TRACE. There are loads of these errors in the log. I am assuming one for each GPU.

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 6 failed (Connect)

These also seem noteworthy, even if they might not be errors per se:

proxy.cc:1119 NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 9, retcode 2

misc/ibvwrap.cc:311 NCCL WARN Call to ibv_create_qp failed with error Operation not supported
`

I think this is the core issue. It is likely your container does not have the right permissions setup to talk to the IB device’s command interface via syscalls. I am not too familiar with Singularlity’s internals for IB devices, but you should have the /dev/infiniband/uverbsX device mapped into the container for the IB verbs API to send commands down to IB devices.

This might help: Bind Paths and Mounts — Singularity container 2.5.1 documentation

Make sure /dev is available in the container, and is writable. Perhaps you have an admin-level singularity config that excludes that binding?

Ok, so we got this to work and here’s how we did it.

It turns out the version of the MLNX_OFED package was different in our image and in our hardware. We updated MLNX_OFED in the former to match the one in the latter.

We also had to make some adjustments to these two environment variables.

export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_HCA=mlx5_0:1

We set their values in the slurm script.