pytorch-lightning say “NCCL backends are built and included in PyTorch distributed (NCCL only when building with CUDA)”
When I upgrade my driver and docker from
nvidia-driver-470 on the host
FROM pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime
to
nvidia-driver-510 on the host
FROM pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
or later
I get the error
misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
I just upgraded my 2x (Titan RTX with nvlink) to 2x (A600 with no nvlink), pl version pytorch-lightning==1.0.7 and same error for version 1.5 .
My call to train is as follows:
trainer = pl.Trainer( gpus=[0,1],
distributed_backend='ddp', # strategy='ddp' for pl=1.5
...
I can’t stay on pytorch:1.6.0-cuda10.1-cudnn7 since A6000 needs sm_86 support.
Is there something else I need to do when going from the docker image of pytorch:1.6.0-cuda10.1-cudnn7 to pytorch:1.7.1-cuda11.0-cudnn8-runtime or later to use ddp?
Demo of Issue
Versions
docker image pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
bug_1 | Collecting environment information…
bug_1 | PyTorch version: 1.10.0
bug_1 | Is debug build: False
bug_1 | CUDA used to build PyTorch: 11.3
bug_1 | ROCM used to build PyTorch: N/A
bug_1 | OS: Ubuntu 18.04.6 LTS (x86_64)
bug_1 | GCC version: Could not collect
bug_1 | Clang version: Could not collect
bug_1 | CMake version: Could not collect
bug_1 | Libc version: glibc-2.17
bug_1 | Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] (64-bit runtime)
bug_1 | Python platform: Linux-5.4.0-100-generic-x86_64-with-debian-buster-sid
bug_1 | Is CUDA available: True
bug_1 | CUDA runtime version: Could not collect
bug_1 | GPU models and configuration:
bug_1 | GPU 0: NVIDIA RTX A6000
bug_1 | GPU 1: NVIDIA RTX A6000
bug_1 |
bug_1 | Nvidia driver version: 510.47.03
bug_1 | cuDNN version: Could not collect
bug_1 | HIP runtime version: N/A
bug_1 | MIOpen runtime version: N/A
bug_1 | Is XNNPACK available: True
bug_1 | Versions of relevant libraries:
bug_1 | [pip3] numpy==1.21.2
bug_1 | [pip3] pytorch-lightning==1.5.10
bug_1 | [pip3] torch==1.10.0
bug_1 | [pip3] torchelastic==0.2.0
bug_1 | [pip3] torchmetrics==0.7.2
bug_1 | [pip3] torchtext==0.11.0
bug_1 | [pip3] torchvision==0.11.0
bug_1 | [conda] blas 1.0 mkl
bug_1 | [conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
bug_1 | [conda] ffmpeg 4.3 hf484d3e_0 pytorch
bug_1 | [conda] mkl 2021.3.0 h06a4308_520
bug_1 | [conda] mkl-service 2.4.0 py37h7f8727e_0
bug_1 | [conda] mkl_fft 1.3.1 py37hd3c417c_0
bug_1 | [conda] mkl_random 1.2.2 py37h51133e4_0
bug_1 | [conda] numpy 1.21.2 py37h20f2e39_0
bug_1 | [conda] numpy-base 1.21.2 py37h79a1101_0
bug_1 | [conda] pytorch 1.10.0 py3.7_cuda11.3_cudnn8.2.0_0 pytorch
bug_1 | [conda] pytorch-lightning 1.5.10 pypi_0 pypi
bug_1 | [conda] pytorch-mutex 1.0 cuda pytorch
bug_1 | [conda] torchelastic 0.2.0 pypi_0 pypi
bug_1 | [conda] torchmetrics 0.7.2 pypi_0 pypi
bug_1 | [conda] torchtext 0.11.0 py37 pytorch
bug_1 | [conda] torchvision 0.11.0 py37_cu113 pytorch
bug_1 | ****************************
bug_1 | initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
bug_1 | /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:120: UserWarning: You passed in a val_dataloader
but have no validation_step
. Skipping val loop.
bug_1 | rank_zero_warn(“You passed in a val_dataloader
but have no validation_step
. Skipping val loop.”)
bug_1 | initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
bug_1 | ----------------------------------------------------------------------------------------------------
bug_1 | distributed_backend=nccl
bug_1 | All distributed processes registered. Starting with 2 processes
bug_1 | ----------------------------------------------------------------------------------------------------
bug_1 |
bug_1 |
bug_1 | be7823cc7810:7:7 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
bug_1 | NCCL version 2.10.3+cuda11.3
bug_1 |
bug_1 | be7823cc7810:72:72 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]