Hi,
I have managed to deploy PyTorch 0.2 to V100 using the AWS AMI Deep Learning Image. I installed Miniconda3 version <= 4.3.21 and installed the required dependencies.
However, I wish to create a lean stack with our industry specific dependencies. I am having trouble with the NCCL2 installation. I d/l the .deb package from NVIDIA Developer, however when I extract it, and attempt to install the two .deb files:
libnccl2_2.0.5-3+cuda9.0_amd64.deb
libnccl-dev_2.0.5-3+cuda9.0_amd64.deb
I am receiving an error on libnccl2_2.0.5-3+cuda9.0_amd64.deb
Preparing to unpack libnccl2_2.0.5-3+cuda9.0_amd64.deb ...
Unpacking libnccl2 (2.0.5-3+cuda9.0) ...
dpkg: dependency problems prevent configuration of libnccl2:
libnccl2 depends on cuda-cudart-9-0; however:
Package cuda-cudart-9-0 is not installed.
dpkg: error processing package libnccl2 (--install):
dependency problems - leaving unconfigured
Processing triggers for libc-bin (2.23-0ubuntu9) ...
Errors were encountered while processing:
libnccl2
This is kinda crazy since I know for a fact that cuda9 is installed. I have no idea how to resolve this. I assume once I resolve, then libnccl-dev_2.0.5-3+cuda9.0_amd64.deb will install fine. I also assume that I copy across those files and place them into those directories where the PyTorch build will look for NCCL path/location.
The NVIDIA install NCCL2 documentation isn’t very comprehensive. I am hoping someone here might know a hack or trick to get this working.
I have CUDA9, CUDNN7 working fine, and confirmed with --version. At this point, I just need to install NCCL2, then I can build PyTorch.
I am pretty confident once I can sort out the NCCL2 installation, I can get it to run on bare metal.
Any help will be appreciated.
Thankyou…