GPU-efficient pytorch code

I am using pytorch to implement an algorithm which is not related to neural networks or ML. Speed is paramount, so I intend to run it on a GPU.
Are there are any general guidelines for writing time-efficient code with pytroch that will best utilize the GPU’s processing power and parallelism?


If you’re working only in python. The main idea is to perform as few ops as possible (making them as big as possible). The GPU is amazing to speeding up a given op but is very bad at switching between op and very very bad to execute small ops.
So make sure everything is stored in Tensors as large as possible and always work with the full Tensor.

Do you have particular code in mind?

What I learned with some experience of mine are (including above advice)

  1. Try to map each operation to matrix multiplication as much as possible
  2. avoid gather and select
  3. don’t let GPU wait for CPU batch preparation, make sure that batch data is processed beforehand or done in background process. Keep GPU busy all the time. “nvidia-smi” is your friend to check GPU usage.
1 Like