Pytorch, CUDA, and NCCL

ndronen · September 20, 2021, 8:57pm

I’d like to upgrade NCCL on my system to 2.10.3; it supports bfloat16, which I’d like to use. I don’t know the dependency relationships among Pytorch, CUDA, and NCCL. Does Pytorch have bindings to particular versions of NCCL, as suggested by this issue? Can I choose to use a newer version of NCCL without upgrading either Pytorch or CUDA?

ptrblck · September 21, 2021, 3:22am

The PyTorch binaries ship with a statically linked NCCL using the NCCL submodule. The current CUDA11.3 nightly binary uses NCCL 2.10.3 already, so you could use it.
On the other hand, if you want to use a specific NCCL version, which isn’t shipped in a binary release, you could build from source and use your locally installed NCCL via:

NCCL_INCLUDE_DIR="/usr/include/" \
    NCCL_LIB_DIR="/usr/lib/" \
    USE_SYSTEM_NCCL=1 \
    python setup.py install

ndronen · September 21, 2021, 1:25pm

Thanks. Do you know when the current CUDA 11.3 nightly will become official?

ptrblck · September 21, 2021, 5:05pm

We are currently targeting PyTorch 1.10.0 as the stable release using the CUDA11.3 runtime.

ndronen · September 21, 2021, 5:23pm

Thank you, ptrblock.