PyTorch-V100 - NCCL2 Installation

TeslasGhost · November 7, 2017, 8:21am

Hi,

I have managed to deploy PyTorch 0.2 to V100 using the AWS AMI Deep Learning Image. I installed Miniconda3 version <= 4.3.21 and installed the required dependencies.

However, I wish to create a lean stack with our industry specific dependencies. I am having trouble with the NCCL2 installation. I d/l the .deb package from NVIDIA Developer, however when I extract it, and attempt to install the two .deb files:

libnccl2_2.0.5-3+cuda9.0_amd64.deb
libnccl-dev_2.0.5-3+cuda9.0_amd64.deb

I am receiving an error on libnccl2_2.0.5-3+cuda9.0_amd64.deb

Preparing to unpack libnccl2_2.0.5-3+cuda9.0_amd64.deb ...
Unpacking libnccl2 (2.0.5-3+cuda9.0) ...
dpkg: dependency problems prevent configuration of libnccl2:
 libnccl2 depends on cuda-cudart-9-0; however:
  Package cuda-cudart-9-0 is not installed.

dpkg: error processing package libnccl2 (--install):
 dependency problems - leaving unconfigured
Processing triggers for libc-bin (2.23-0ubuntu9) ...
Errors were encountered while processing:
 libnccl2

This is kinda crazy since I know for a fact that cuda9 is installed. I have no idea how to resolve this. I assume once I resolve, then libnccl-dev_2.0.5-3+cuda9.0_amd64.deb will install fine. I also assume that I copy across those files and place them into those directories where the PyTorch build will look for NCCL path/location.

The NVIDIA install NCCL2 documentation isn’t very comprehensive. I am hoping someone here might know a hack or trick to get this working.

I have CUDA9, CUDNN7 working fine, and confirmed with --version. At this point, I just need to install NCCL2, then I can build PyTorch.

I am pretty confident once I can sort out the NCCL2 installation, I can get it to run on bare metal.

Any help will be appreciated.

Thankyou…

tom · November 8, 2017, 7:33am

If you install cuda without .deb, you can’t install depending libraries using .debs. ~~You can build nccl2 yourself, that should be very straightforward.~~ (nccl2 you cannot) You could extract the package with dpkg -x or -e, I can never remember.

Best regards

Thomas