PyTorch-V100 - NCCL2 Installation


I have managed to deploy PyTorch 0.2 to V100 using the AWS AMI Deep Learning Image. I installed Miniconda3 version <= 4.3.21 and installed the required dependencies.

However, I wish to create a lean stack with our industry specific dependencies. I am having trouble with the NCCL2 installation. I d/l the .deb package from NVIDIA Developer, however when I extract it, and attempt to install the two .deb files:


I am receiving an error on libnccl2_2.0.5-3+cuda9.0_amd64.deb

Preparing to unpack libnccl2_2.0.5-3+cuda9.0_amd64.deb ...
Unpacking libnccl2 (2.0.5-3+cuda9.0) ...
dpkg: dependency problems prevent configuration of libnccl2:
 libnccl2 depends on cuda-cudart-9-0; however:
  Package cuda-cudart-9-0 is not installed.

dpkg: error processing package libnccl2 (--install):
 dependency problems - leaving unconfigured
Processing triggers for libc-bin (2.23-0ubuntu9) ...
Errors were encountered while processing:

This is kinda crazy since I know for a fact that cuda9 is installed. I have no idea how to resolve this. I assume once I resolve, then libnccl-dev_2.0.5-3+cuda9.0_amd64.deb will install fine. I also assume that I copy across those files and place them into those directories where the PyTorch build will look for NCCL path/location.

The NVIDIA install NCCL2 documentation isn’t very comprehensive. I am hoping someone here might know a hack or trick to get this working.

I have CUDA9, CUDNN7 working fine, and confirmed with --version. At this point, I just need to install NCCL2, then I can build PyTorch.

I am pretty confident once I can sort out the NCCL2 installation, I can get it to run on bare metal.

Any help will be appreciated.


If you install cuda without .deb, you can’t install depending libraries using .debs. You can build nccl2 yourself, that should be very straightforward. (nccl2 you cannot) You could extract the package with dpkg -x or -e, I can never remember.

Best regards


1 Like