Pytorch on V100 GPU

I’ve submitted a PR with cuda 9 Dockerfile. https://github.com/pytorch/pytorch/pull/3445

1 Like

Your shared link’s tip is helpful, though it is still inconvenient to execute these commands every time. Is there any way to do them automatically? Thank you.

Hi @ngimel Is it possible to summarize the exact steps to follow, for a base Ubuntu 16.04 aws image? Like, including

All the talk of docker, custom deep learning images etc makes me worry that instalaltion on a base 16.04 instance doesnt work, or needs some tricky hacking?

@hughperkins, installation on a base image should work, starting with AWS DL cuda 9 image just saves you a few steps (installing driver + cuda 9 toolkit + cudnn v7 + nccl). @penguinshin has a good summary for starting with cuda 9 dl AMI.

1 Like

If you are looking to run on AWS bare metal then take a look at the linked packer.io script in my prior post: https://github.com/Pinafore/qb/blob/master/packer/packer_gpu.json

If you are unfamiliar with how Packer works you can check packer.io, but in short:

  1. Execute packer build packer_gpu.json will run the json file
  2. This will create a new instance on EC2 based on the base ubuntu image according to the spec at the top of the file
  3. It will then execute each script in order on the instance
  4. Finally, it will snapshot the instance and create an AMI, for example the most recent one I built is public with an ID of ami-84e12afc

If you just want the instructions for just the DL stuff, then look at the scripts in bin/install-* referring to cuda, python, and dl-libs.

I think that installing things is easier on base ubuntu, but for deployment purposes docker could be desirable. Since I’m working on research with no real deployment, its not that important to me.

So, I got pytorch running on the v100, but seems that building from master was not ideal, lots of … evolutions … in progress:

  File "ecn.py", line 501, in run
    action.reinforce(alive_rewards[:, agent].contiguous().view(_batch_size, 1))
  File "/awsnas/hugh/conda_cuda9/lib/python3.6/site-packages/torch/autograd/variable.py", line 218, in reinforce
    if not isinstance(self.grad_fn, StochasticFunction):
NameError: name 'StochasticFunction' is not defined

I will try building from a release branch/tag instead.

building from v0.2.0 gives me an error on clone:

git clone --recursive https://github.com/pytorch/pytorch -b v0.2.0 pytorch_cuda9_20
Cloning into 'pytorch_cuda9_20'...
remote: Counting objects: 44650, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 44650 (delta 0), reused 1 (delta 0), pack-reused 44646
Receiving objects: 100% (44650/44650), 16.92 MiB | 26.65 MiB/s, done.
Resolving deltas: 100% (33736/33736), done.
Checking connectivity... done.
Checking out files: 100% (1514/1514), done.
fatal: no submodule mapping found in .gitmodules for path 'torch/lib/gloo/third-party/googletest'

I didn’t find a particular branch/tag to use, but arbitrarily picked commit 66d24c5 which seems to work fine.

I don’t think the v0.2.0 has cuda 9, but I could be way wrong

0.2 branch does not contain all the necessary changes for cuda 9/volta. You are right, a lot of changes lately, some of them related to reinforcement learning. Talk to reinforcement people which commit is usable.
In particular, some StochasticFunctions were removed in this PR https://github.com/pytorch/pytorch/pull/3165

“Merged 15 days ago”. Reversing the constraints, whats the earliest commit that supports cuda 9?

Definitely earlier than 15 days ago :slight_smile: But there were (and still are) many commits in flight, enabling different parts, e.g. the commit to use tensor ops in cudnn rnns went in just a few days back.

Hello,

are there any updates about this? What is the easiest way to make pytorch work on the V100?

Now that 0.3 is out, it should “just work”.

cudnn and cublas use tensor cores, so it should be accelerated automatically.

Actually I might be wrong. CUBLAS_TENSOR_OP_MATH seems to only be used in THCudaBlas_Hgemm. Which means its only using the tensor cores when doing half precision GEMM. Similar issue for convolutions:

if(dataType == CUDNN_DATA_HALF)
  CUDNN_CHECK(cudnnSetConvolutionMathType(desc, CUDNN_TENSOR_OP_MATH));

@Soumith_Chintala @apaszke is my assessment here correct? Sorry if I missed something (quite likely!) If my assessment is right, will we see tensor cores used for f32 in the future, or is that problematic for some reason?

Don’t tensor cores require half inputs (they do fp32-like math, but all transfers are in fp16)?

cuDNN must be half precision, but GEMM can be single precision, according to rule 4 of https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/

Also while I have your attention, I haven’t been successful in finding any sample code showing training in half precision in pytorch. Do you have any examples or tips you could provide? I’m aware of the nvidia tips post, but guidance on converting this to pytorch would be really helpful.

(With the AWS P3 instance and now Titan V GPU, this is becoming more relevant to me and my students.)

Though cublas supports Tensor Core execution for fp32 inputs/outputs, this path is not enabled in pytorch due to concerns about accuracy. The way it is implemented is fp32 data is rounded to fp16 on load, so precision is inherently less than with full fp32 compute. Cudnn does not support Tensor Cores for fp32 inputs/outputs.
https://github.com/pytorch/examples/pull/203/commits has examples of fp16 training for imagenet and word language model. Imagenet requires some changes to pytorch core, but word language should work. cc @csarofeen

1 Like

Thanks @ngimel that’s extremely helpful. Do you know if there’s any plans to incorporate some of that in to core (e.g. the BN fixes and the fp16utils fixes)? If not, I’ll try to incorporate it directly into fastai.

BN fix for master is 2 lines (upd: just merged) https://github.com/pytorch/pytorch/pull/4021, then you can use standard nn.BatchNorm2d, just have to make sure that you don’t convert it to .half() (if you did convert it to .half(), upconvert it back to .float()). For 0.3, before cudnn migration to ATen, the fix was more annoying, there’s a closed PR by @csarofeen to the core, and he has examples branches for both 0.3 and master in his examples fork https://github.com/csarofeen/examples/. The rest of the fp16utils is not going into core in the near future.

1 Like