How can I change nccl version in pytorch?

hi I’m using cuda 11.3 and if I run multi-gpus it freezes so I thought it would be solved if I change pytorch.cuda.nccl.version…

also is there any way to find nccl 2.10.3 in my env? because apt search nccl didn’t show any 2.10.3 version that shows in torch.cuda.nccl.version. I wonder if I remove 2.10.3, then torch would set the default version as 2.9.9.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.10.2+cu113'
>>> torch.cuda.nccl.version()
(2, 10, 3)
libhttpasyncclient-java/focal 4.1.4-1 all
  HTTP/1.1 compliant asynchronous HTTP agent implementation

libnccl-dev/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
  NVIDIA Collective Communication Library (NCCL) Development Files

libnccl2/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
  NVIDIA Collective Communication Library (NCCL) Runtime

libpuppetlabs-http-client-clojure/focal 0.9.0-1 all
  Clojure wrapper around libhttpasyncclient-java

libvncclient1/focal-updates,focal-security 0.9.12+dfsg-9ubuntu0.3 amd64
  API to write one's own VNC server - client library

python-ncclient-doc/focal 0.6.0-2.1 all
  Documentation for python-ncclient (Python library for NETCONF clients)

python3-ncclient/focal 0.6.0-2.1 all
  Python library for NETCONF clients (Python 3)

thanks

The binaries ship with their own CUDA runtime, cuDNN, NCCL, etc. libs, so you won’t be able to change them directly.
You could build PyTorch from source and use your locally installed NCCL via e.g.:

NCCL_INCLUDE_DIR="/usr/include/" NCCL_LIB_DIR="/usr/lib/" USE_SYSTEM_NCCL=1 python setup.py install

I deleted my torch downloaded from pip and tried to download from source

conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
conda install -c pytorch magma-cuda110 
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive --jobs 0
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

NCCL_INCLUDE_DIR="/usr/local/cuda-11.3/targets/x86_64-linux/include“ NCCL_LIB_DIR="/usr/local/cuda-11.3/targets/x86_64-linux/“lib USE_SYSTEM_NCCL=1 python setup.py install

but it crashes with error

--   Private Dependencies : pthreadpool;cpuinfo;qnnpack;pytorch_qnnpack;nnpack;XNNPACK;fbgemm;fp16;/root/anaconda3/lib/libmpicxx.so;/root/anaconda3/lib/libmpi.so;gloo;tensorpipe;foxi_loader;rt;fmt::fmt-header-only;kineto;gcc_s;gcc;dl
--   USE_COREML_DELEGATE     : OFF
-- Configuring incomplete, errors occurred!

Could you check in the build logs what exactly failed?

it was nccl p2p not the nccl version in pytorch. thanks!

Speaking of PyTorch compiled from source, we have noticed that there are problems when PyTorch is compiled with one version of NCCL and then later in deployment, we use another version, even if it’s compiled to use dynamic symbols.

The question is: Is PyTorch able to swap versions of NCCL or do we have to recompile it for each NCCL upgrade?

We noticed that some stuff seem hardcoded, like this: pytorch/nccl.cpp at 428e02461f7b1079428012cd8c885bb892298c8c · pytorch/pytorch · GitHub

I don’t think manipulating dynamic links of any library is a supported use case and you would have to use it at your own risk (it can be a great debugging tool).

Hey, that’s not the case. The thing is that NCCL got upgraded on our supercomputers, but PyTorch wasn’t. Are you saying that we would have to recompile PyTorch all again because of dynamic libraries? (The NCCL upgrade was from 2.11.4 to 2.12.7)

Yes, I think if you are using dynamic linking and are upgrading NCCL on your clusters, the safe approach would be to rebuild PyTorch. If that’s not a desired use case, try to use static linking.

The suggestion that one uses static linking makes absolutely no sense.

The idea of using dynamic libraries is exactly one when the ABI is consistent (and NCCL is), one can interchange versions of the dynamically-loaded library. That’s literally what dynamic libraries are for.

You are right that static linking would not allow you to change libraries, but I also didn’t see any explanation of your actual use case besides pointing out problems when your deployment use case uses another NCCL version (by accident? on purpose? if so, why?)

So to your original question:

If you see improvements, contributions are more than welcome.

I am sorry if my answer missed crucial details. I should have been clearer and I will try to improve.

In the past, we found NCCL bugs which were showstoppers for some runs in larger scales (more than 512 nodes and 2048 gpus, for example). In that case, a simple replacement of NCCL (or, sometimes, UCX) was enough to fix things.

Their API has been stable enough that we can use them in an interchangeable way (at least during minor upgrades, such as from 2.10 to 2.14 and so on).

Given that PyTorch calls NCCL dynamically, there is in general little problem with that - better said: none so far. The problem lies in that those lines assume a version used at compile time and give a wrong answer when probing for the NCCL version.

In pytorch/torch/csrc/cuda/nccl.cpp, in line 334, instead of relying on the NCCL_MINOR and NCCL_PATCH, one could use the version detection which already exists on the code.

1 Like

So is there any other solution to change nccl version in pytorch instead of compile pytorch myself in 2023 April?

You might be able to hack around in the binary and replace NCCL libraries, which is unsupported since the pip wheels and conda binaries ship with the tagged CUDA, cuDNN, and NCCL version by design.

I am told in a git issue response from NCCL that I should use nccl 2.17.1 for solving some of my issues.

In that case, what should I change my pytorch version to? The current version works for nccl: 2.12.10

I am using a yaml for dependencies like this.

train-env.yaml

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: nvidia_pytorch
build:
  path: ../../../data-science/environment/
tags:
  os: ubuntu
  os_version: 20.04
  hpcx: 2.10
  mpi: openmpi
  mpi_version: 4.1.2rc4
  ucx: 1.12.0
  cuda: 11.6.2
  cudnn: 8.4.0.27
  nccl: 2.12.10
  # nccl: 2.17.1
  rdma_core: 36.0
  nsight_compute: 2022.1.1.2
  nsight_systems: "2022.2.1.31-5fe97ab"
  nccl_test: 2.11.0
  # azureml-defaults: 1.41.0
  # mlflow: 1.25.1
  azureml-defaults: 1.50.0
  mlflow: 2.3.2
  transformers: 4.18.0

here is my requirements.txt

 # for local testing (cpu)
torchvision==0.12.0
torch==1.11.0
transformers==4.18.0

# for metrics reporting/plotting
# mlflow==1.25.1
# azureml-mlflow==1.41.0
mlflow==2.3.2
azureml-mlflow==1.50.0
matplotlib==3.5.2
tqdm==4.64.0
psutil==5.9.0

# for unit testing
pytest==7.1.2

and here is my Dockerfile:

# check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
FROM nvcr.io/nvidia/pytorch:22.04-py3

##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0

# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
    cd /tmp/nccltests && \
    git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make \
    MPI=1 MPI_HOME=/opt/hpcx/ompi \
    NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    CUDA_HOME=/usr/local/cuda && \
    cp ./build/* /usr/local/bin && \
    rm -rf /tmp/nccltests

# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt


# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"
# ENV NCCL_SOCKET_IFNAME='lo'
ENV NCCL_IB_DISABLE="1"

I also came across a nccl problem, where only 2.18.3 works for me while pytorch 2.2.0 requires 2.19.3 .

My solution is to write a nccl binding myself, and keep my nccl version separate from pytorch nccl version to avoid any conflict.