Pytorch on V100 GPU

Amir_Rosenfeld · October 31, 2017, 4:57pm

Has anyone succeeded with running pytorch on a V100 gpu?
I’ve tried this on one of the new Amazon ec2 instances, but to no avail : calling .cuda() on a tensor
simply makes python hang, seemingly indefinitely.

Does pytorch support usage of V100?

richard · October 31, 2017, 5:58pm

What version of pytorch are you on? What is your cuda version? What is the output of your nvidia-smi when python is hanging when you call cuda() on a tensor?

Amir_Rosenfeld · October 31, 2017, 6:11pm

I currently stopped the machine, however, I used the latest pytorch version installed via conda for python 3.6 as per the instructions in the pytorch.org website. cuda 9+nvidia driver from here:
https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run

ngimel · October 31, 2017, 6:12pm

Latest pytorch binary (that you’ve installed via conda) does not support volta, you have to compile from source. Make sure you have cuda 9 versions of nccl and cudnn.

penguinshin · October 31, 2017, 7:27pm

So I got it work by launching from the latest ubuntu deep learning AMI for CUDA 9, then installing torch and torchvision from source. But to run torchvision from source I had to give write permissions to anaconda’s easy installer. Then it worked. I’m not sure why i had to do this, or if thats good practice to give those permissions, but after that it worked. However, I did notice that upon calling Tensor(1).cuda() for the first time, it took about 10 seconds for that to load, and immediately ate about 700M of GPU memory, but after that everything went smooth.

JasonRen · November 1, 2017, 12:05am

Hi, did you do “conda install -c soumith magma-cuda80” when you install from source? How to deal with the cuda version issues?

Amir_Rosenfeld · November 1, 2017, 1:26am

Do you mind posting the steps you used to do that? For example, as @JasonRen mentioned, did you use the “magma-cuda80” option? This seems to contradict the cuda9.

richard · November 1, 2017, 1:43am

You don’t need magma-cuda80 to use cuda 9 and I think installing it will probably cause some issues. magma provides some operations that you can probably get along without.

Amir_Rosenfeld · November 1, 2017, 1:07pm

Where do you get nccl?

moskomule · November 1, 2017, 2:12pm

Sorry, this is a basic question but does PyTorch support Cuda9? At least when I tried to build with CUDA9 in nvidia-docker, it failed…

Mehdi2277 · November 1, 2017, 2:17pm

It does support cuda 9, although I remember when I built it about a month ago it failed initially because the nccl that comes with pytorch by default did not work with cuda 9. So I found nccl2 on github, and then copied over its files to the nccl folder that comes with pytorch and was able to build it successfully.

ngimel · November 1, 2017, 6:26pm

Amazon’s CUDA 9 deep learning AMI has it, otherwise you can get it at http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/

ngimel · November 1, 2017, 6:28pm

Using nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 as your base image should work, you still have to install nccl-dev from http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/

penguinshin · November 1, 2017, 8:04pm

@Amir_Rosenfeld, the steps are:

get AWS Deep learning AMI ubuntu 16.04 with CUDA 9
Clone pytorch/torchvision from master
Run setup.py for pytorch following the instructions on the pytorch website, but leave out magma
Run setup.py for torchvision but only after giving permisions to easy-install in the anaconda folder AND edit the setup.py file so that torch isn’t a requirement (I noticed that for some reason it didn’t see the pytorch installation from source, and would try to re-install the latest conda version, which doesn’t support cuda 9)

For me, when I used conda or pip directly for torch or vision, it ruined everything. Note that I still use conda for the steps in the github tutorial:

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/…/" # [anaconda root directory]

Install basic dependencies

conda install numpy pyyaml mkl setuptools cmake cffi

Amir_Rosenfeld · November 1, 2017, 8:16pm

Thanks, I’ll try this out!

EntilZha · November 1, 2017, 9:02pm

I was able to get PyTorch working on the V100 instances on AWS as well. I have my own Packer based script to package/customize an AMI using anaconda python where the following script handles the GPU stuff https://github.com/Pinafore/qb/blob/master/packer/bin/install-cuda.sh.

That was written for CUDA 8, so I started with an instance using that image, and then installed CUDA 9 plus compatible versions of cuDNN and NCCL. After that I was able to do a simple python setup.py install from the current master branch then start using PyTorch normally.

Amir_Rosenfeld · November 2, 2017, 1:53am

Eureka - thank you , @EntilZha and @penguinshin for you priceless advice, it works

moskomule · November 2, 2017, 2:48pm

Thank you.
Rewriting the original Dockerfile to cuda9’s and adding a line to download and install nccl-dev etc to the file, then the build itself successfully finished. Though after running a container and run nvidia-docker run --rm -it pytorch-cuda9 nvidia-smi, an error occurs.

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Do I need to add something else or do something else? Thank you for advance.

ngimel · November 2, 2017, 4:04pm

This issue may be related https://github.com/NVIDIA/nvidia-docker/issues/155#issuecomment-236443215, You can verify that is is not a pytorch-related problem by trying to run base cuda 9 container.

moskomule · November 2, 2017, 4:23pm

Thank you. Executing

ln -s /usr/local/nvidia/lib64/libnvidia-ml.so.3xx.xx /lib/x86_64-linux-gnu/libnvidia-ml.so
ldconfig

solved the problem.

nvidia-docker run --rm -it nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 nvidia-smi works fine.