A6000 NCCL WARN Failed to open libibverbs.so

pytorch-lightning say “NCCL backends are built and included in PyTorch distributed (NCCL only when building with CUDA)”

When I upgrade my driver and docker from

nvidia-driver-470 on the host
FROM pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime

to

nvidia-driver-510 on the host
FROM pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
or later

I get the error
misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

I just upgraded my 2x (Titan RTX with nvlink) to 2x (A600 with no nvlink), pl version pytorch-lightning==1.0.7 and same error for version 1.5 .

My call to train is as follows:

trainer = pl.Trainer( gpus=[0,1],  
        distributed_backend='ddp', # strategy='ddp' for pl=1.5
        ...

I can’t stay on pytorch:1.6.0-cuda10.1-cudnn7 since A6000 needs sm_86 support.

Is there something else I need to do when going from the docker image of pytorch:1.6.0-cuda10.1-cudnn7 to pytorch:1.7.1-cuda11.0-cudnn8-runtime or later to use ddp?

Demo of Issue

Versions

docker image pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime

bug_1 | Collecting environment information…
bug_1 | PyTorch version: 1.10.0
bug_1 | Is debug build: False
bug_1 | CUDA used to build PyTorch: 11.3
bug_1 | ROCM used to build PyTorch: N/A
bug_1 | OS: Ubuntu 18.04.6 LTS (x86_64)
bug_1 | GCC version: Could not collect
bug_1 | Clang version: Could not collect
bug_1 | CMake version: Could not collect
bug_1 | Libc version: glibc-2.17
bug_1 | Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] (64-bit runtime)
bug_1 | Python platform: Linux-5.4.0-100-generic-x86_64-with-debian-buster-sid
bug_1 | Is CUDA available: True
bug_1 | CUDA runtime version: Could not collect
bug_1 | GPU models and configuration:
bug_1 | GPU 0: NVIDIA RTX A6000
bug_1 | GPU 1: NVIDIA RTX A6000
bug_1 |
bug_1 | Nvidia driver version: 510.47.03
bug_1 | cuDNN version: Could not collect
bug_1 | HIP runtime version: N/A
bug_1 | MIOpen runtime version: N/A
bug_1 | Is XNNPACK available: True
bug_1 | Versions of relevant libraries:
bug_1 | [pip3] numpy==1.21.2
bug_1 | [pip3] pytorch-lightning==1.5.10
bug_1 | [pip3] torch==1.10.0
bug_1 | [pip3] torchelastic==0.2.0
bug_1 | [pip3] torchmetrics==0.7.2
bug_1 | [pip3] torchtext==0.11.0
bug_1 | [pip3] torchvision==0.11.0
bug_1 | [conda] blas 1.0 mkl
bug_1 | [conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
bug_1 | [conda] ffmpeg 4.3 hf484d3e_0 pytorch
bug_1 | [conda] mkl 2021.3.0 h06a4308_520
bug_1 | [conda] mkl-service 2.4.0 py37h7f8727e_0
bug_1 | [conda] mkl_fft 1.3.1 py37hd3c417c_0
bug_1 | [conda] mkl_random 1.2.2 py37h51133e4_0
bug_1 | [conda] numpy 1.21.2 py37h20f2e39_0
bug_1 | [conda] numpy-base 1.21.2 py37h79a1101_0
bug_1 | [conda] pytorch 1.10.0 py3.7_cuda11.3_cudnn8.2.0_0 pytorch
bug_1 | [conda] pytorch-lightning 1.5.10 pypi_0 pypi
bug_1 | [conda] pytorch-mutex 1.0 cuda pytorch
bug_1 | [conda] torchelastic 0.2.0 pypi_0 pypi
bug_1 | [conda] torchmetrics 0.7.2 pypi_0 pypi
bug_1 | [conda] torchtext 0.11.0 py37 pytorch
bug_1 | [conda] torchvision 0.11.0 py37_cu113 pytorch
bug_1 | ****************************
bug_1 | initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
bug_1 | /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:120: UserWarning: You passed in a val_dataloader but have no validation_step. Skipping val loop.
bug_1 | rank_zero_warn(“You passed in a val_dataloader but have no validation_step. Skipping val loop.”)
bug_1 | initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
bug_1 | ----------------------------------------------------------------------------------------------------
bug_1 | distributed_backend=nccl
bug_1 | All distributed processes registered. Starting with 2 processes
bug_1 | ----------------------------------------------------------------------------------------------------
bug_1 |
bug_1 |
bug_1 | be7823cc7810:7:7 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
bug_1 | NCCL version 2.10.3+cuda11.3
bug_1 |
bug_1 | be7823cc7810:72:72 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

This is not necessarily an error and just a warning that NCCL couldn’t find libibverbs.so and as result things like Infiniband connectivity might be affected. Are you using something like Infiniband or GPU RDMA in your setup?

Just 2 A6000 in PCIe slots. Waiting for nvlink.
I create a small demo of this issue and it just sits there.

My real code just sits there too when I use 2 GPU and ddp. With Titan RTX this code was stable for a year. If I just one one GPU it works.

This might be a pytorch lightning setup issue, can you create an issue in GitHub - PyTorchLightning/pytorch-lightning: The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. to confirm whether it is indeed related to PyTorch Lightning?

Btw does your script exit or just get stuck after printing those warnings? If it is the latter, I think it might be some sort of setup issue related to PyTorch Lightning.

The demo code runs to completion with 1 GPU and no dpp in about 2 minutes on A6000.

My real code is too big, so I created a very short demonstration of this issue at small demo code.

Since my big code ran correctly with dpp using TITAN RTX on pytorch 1.6, and my code fails in exactly the same way as the demo with pytorch 1.7, it looks like and issue with either NVIDIA or pytorch. But I did raise an item with pytorch-lightining

Does NCCL work on all PCIE setups (ie: all motherboards)?
My motherboard is Gigabyte DesignAre.

The issue is also cross-posted here and so far I’m unable to reproduce the hang.
Let’s wait for the nccl-test runs on your setup to see if NCCL is working at all or what is causing the hang.

1 Like

ptrblck Thank you for suggesting to disable IOMMU. IOMMU=on did cause my system to lock up. You are so kind to offer your help to everyone. Thank you.

1 Like

Good to hear it’s working now! :slight_smile:

1 Like