hi I’m using cuda 11.3 and if I run multi-gpus it freezes so I thought it would be solved if I change pytorch.cuda.nccl.version…
also is there any way to find nccl 2.10.3 in my env? because apt search nccl didn’t show any 2.10.3 version that shows in torch.cuda.nccl.version. I wonder if I remove 2.10.3, then torch would set the default version as 2.9.9.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
(2, 10, 3)
libhttpasyncclient-java/focal 4.1.4-1 all
HTTP/1.1 compliant asynchronous HTTP agent implementation
libnccl-dev/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
NVIDIA Collective Communication Library (NCCL) Development Files
libnccl2/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
NVIDIA Collective Communication Library (NCCL) Runtime
libpuppetlabs-http-client-clojure/focal 0.9.0-1 all
Clojure wrapper around libhttpasyncclient-java
libvncclient1/focal-updates,focal-security 0.9.12+dfsg-9ubuntu0.3 amd64
API to write one's own VNC server - client library
python-ncclient-doc/focal 0.6.0-2.1 all
Documentation for python-ncclient (Python library for NETCONF clients)
python3-ncclient/focal 0.6.0-2.1 all
Python library for NETCONF clients (Python 3)
Speaking of PyTorch compiled from source, we have noticed that there are problems when PyTorch is compiled with one version of NCCL and then later in deployment, we use another version, even if it’s compiled to use dynamic symbols.
The question is: Is PyTorch able to swap versions of NCCL or do we have to recompile it for each NCCL upgrade?
Hey, that’s not the case. The thing is that NCCL got upgraded on our supercomputers, but PyTorch wasn’t. Are you saying that we would have to recompile PyTorch all again because of dynamic libraries? (The NCCL upgrade was from 2.11.4 to 2.12.7)
The suggestion that one uses static linking makes absolutely no sense.
The idea of using dynamic libraries is exactly one when the ABI is consistent (and NCCL is), one can interchange versions of the dynamically-loaded library. That’s literally what dynamic libraries are for.
You are right that static linking would not allow you to change libraries, but I also didn’t see any explanation of your actual use case besides pointing out problems when your deployment use case uses another NCCL version (by accident? on purpose? if so, why?)
So to your original question:
If you see improvements, contributions are more than welcome.
I am sorry if my answer missed crucial details. I should have been clearer and I will try to improve.
In the past, we found NCCL bugs which were showstoppers for some runs in larger scales (more than 512 nodes and 2048 gpus, for example). In that case, a simple replacement of NCCL (or, sometimes, UCX) was enough to fix things.
Their API has been stable enough that we can use them in an interchangeable way (at least during minor upgrades, such as from 2.10 to 2.14 and so on).
Given that PyTorch calls NCCL dynamically, there is in general little problem with that - better said: none so far. The problem lies in that those lines assume a version used at compile time and give a wrong answer when probing for the NCCL version.
In pytorch/torch/csrc/cuda/nccl.cpp, in line 334, instead of relying on the NCCL_MINOR and NCCL_PATCH, one could use the version detection which already exists on the code.