CPU to GPU basics

A few beginner-level questions to help move from CPU to GPU. I’ve searched previous responses here but couldn’t find specifics.

I have my code up and running in my local GPU --only one device (for any other beginners running across this post, you need to wrap your Variables (target.cuda()), network (decoder.cuda()) and criterion (criterion.cuda()) in cuda, and it obviously needs to be available in your system: physical GPU, drivers and packages nvidia+cuda.

I want to spin a small GPU cluster and run my RNN there, but I have a few questions:

  1. Are RNNs benefited from GPU’s?

  2. Will code that runs properly in my local GPU run out-of-the-box in a GPU cluster? If not what do I need to be thinking about?

  3. Do GPUs help if I’m using a batch of size 1? Or are batches “good”?

  4. Do I have to manually allocate / transfer or otherwise keep track of which tensor and other objects go to which device? Or does CUDA/PyTorch figure this out automatically?

  5. Do I have to gather anything at the end of the computation? (I’m coming from the Spark world where its a thing sometimes).

  6. For small, simpler models (like the one I’m running) CPU and GPU times will be very similar. If I take this model to a GPU cluster, will I see any improvement? Is the efficiency gain proportional only to model complexity? Or will the simple model run faster the more nodes in my cluster?

Many questions! Feel free to answer one only.


About your question:

  1. Yes, RNNs can benefit from optimized GPU implementations, and PyTorch wraps cudnn, which gives even further speedups
  2. Yes, it will run out of the box, but only in one GPU. If you want to parallelize over multiple GPUs, check http://pytorch.org/docs/master/nn.html#torch.nn.DataParallel for a simple way to distribute computations batch-wise over multiple GPUs
  3. GPUs shine compared to CPUs for larger batch sizes.
  4. If you use nn.DataParallel, everything is handled for you automatically. But you might want to have different ways of using multiple GPUs (for example parts of one model in GPU1, and other parts in GPU2), in which case you need to ship the different parts to the different GPUs yourself (via result.cuda(gpu_idx)).
  5. nn.DataParallel already gathers the information from multiple GPUs for you
  6. For small models, you won’t see any benefits from using GPUs over CPUs, and it won’t improve if you use multiple GPUs. You will need to increase the model size to start to see improvements, because there is some communication overhead to transmit data from different GPUs. Also, there are a number of tricks that are used for improving multi-GPU usage, see https://arxiv.org/abs/1404.5997 for example.

Hope this helps!


Fantastic answer, thanks for the pointers.