How can I change nccl version in pytorch?

hi I’m using cuda 11.3 and if I run multi-gpus it freezes so I thought it would be solved if I change pytorch.cuda.nccl.version…

also is there any way to find nccl 2.10.3 in my env? because apt search nccl didn’t show any 2.10.3 version that shows in torch.cuda.nccl.version. I wonder if I remove 2.10.3, then torch would set the default version as 2.9.9.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
>>> torch.cuda.nccl.version()
(2, 10, 3)
libhttpasyncclient-java/focal 4.1.4-1 all
  HTTP/1.1 compliant asynchronous HTTP agent implementation

libnccl-dev/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
  NVIDIA Collective Communication Library (NCCL) Development Files

libnccl2/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
  NVIDIA Collective Communication Library (NCCL) Runtime

libpuppetlabs-http-client-clojure/focal 0.9.0-1 all
  Clojure wrapper around libhttpasyncclient-java

libvncclient1/focal-updates,focal-security 0.9.12+dfsg-9ubuntu0.3 amd64
  API to write one's own VNC server - client library

python-ncclient-doc/focal 0.6.0-2.1 all
  Documentation for python-ncclient (Python library for NETCONF clients)

python3-ncclient/focal 0.6.0-2.1 all
  Python library for NETCONF clients (Python 3)


The binaries ship with their own CUDA runtime, cuDNN, NCCL, etc. libs, so you won’t be able to change them directly.
You could build PyTorch from source and use your locally installed NCCL via e.g.:

NCCL_INCLUDE_DIR="/usr/include/" NCCL_LIB_DIR="/usr/lib/" USE_SYSTEM_NCCL=1 python install

I deleted my torch downloaded from pip and tried to download from source

conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
conda install -c pytorch magma-cuda110 
git clone --recursive
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive --jobs 0
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

NCCL_INCLUDE_DIR="/usr/local/cuda-11.3/targets/x86_64-linux/include“ NCCL_LIB_DIR="/usr/local/cuda-11.3/targets/x86_64-linux/“lib USE_SYSTEM_NCCL=1 python install

but it crashes with error

--   Private Dependencies : pthreadpool;cpuinfo;qnnpack;pytorch_qnnpack;nnpack;XNNPACK;fbgemm;fp16;/root/anaconda3/lib/;/root/anaconda3/lib/;gloo;tensorpipe;foxi_loader;rt;fmt::fmt-header-only;kineto;gcc_s;gcc;dl
-- Configuring incomplete, errors occurred!

Could you check in the build logs what exactly failed?

it was nccl p2p not the nccl version in pytorch. thanks!

Speaking of PyTorch compiled from source, we have noticed that there are problems when PyTorch is compiled with one version of NCCL and then later in deployment, we use another version, even if it’s compiled to use dynamic symbols.

The question is: Is PyTorch able to swap versions of NCCL or do we have to recompile it for each NCCL upgrade?

We noticed that some stuff seem hardcoded, like this: pytorch/nccl.cpp at 428e02461f7b1079428012cd8c885bb892298c8c · pytorch/pytorch · GitHub

I don’t think manipulating dynamic links of any library is a supported use case and you would have to use it at your own risk (it can be a great debugging tool).

Hey, that’s not the case. The thing is that NCCL got upgraded on our supercomputers, but PyTorch wasn’t. Are you saying that we would have to recompile PyTorch all again because of dynamic libraries? (The NCCL upgrade was from 2.11.4 to 2.12.7)

Yes, I think if you are using dynamic linking and are upgrading NCCL on your clusters, the safe approach would be to rebuild PyTorch. If that’s not a desired use case, try to use static linking.

The suggestion that one uses static linking makes absolutely no sense.

The idea of using dynamic libraries is exactly one when the ABI is consistent (and NCCL is), one can interchange versions of the dynamically-loaded library. That’s literally what dynamic libraries are for.

You are right that static linking would not allow you to change libraries, but I also didn’t see any explanation of your actual use case besides pointing out problems when your deployment use case uses another NCCL version (by accident? on purpose? if so, why?)

So to your original question:

If you see improvements, contributions are more than welcome.

I am sorry if my answer missed crucial details. I should have been clearer and I will try to improve.

In the past, we found NCCL bugs which were showstoppers for some runs in larger scales (more than 512 nodes and 2048 gpus, for example). In that case, a simple replacement of NCCL (or, sometimes, UCX) was enough to fix things.

Their API has been stable enough that we can use them in an interchangeable way (at least during minor upgrades, such as from 2.10 to 2.14 and so on).

Given that PyTorch calls NCCL dynamically, there is in general little problem with that - better said: none so far. The problem lies in that those lines assume a version used at compile time and give a wrong answer when probing for the NCCL version.

In pytorch/torch/csrc/cuda/nccl.cpp, in line 334, instead of relying on the NCCL_MINOR and NCCL_PATCH, one could use the version detection which already exists on the code.