Questions about PyTorch Performance & Non-CUDA Acceleration & AVX-512

Can anyone point me to some recent performance profiling numbers for PyTorch training (e.g. which modules occupy the most CPU/GPU time during training, I am assuming that is autograd?)

Or more to the point. I am interested in implementing non-CUDA acceleration for training in PyTorch; could that be accomplished just with a custom activation and forward/backwards function?

Also, has there been any work towards AVX-512 CPU acceleration with PyTorch?

Thanks in advance


A bunch of cpu acceleration is already implemented in pytorch for low level functions, in particular, if you use high quality blas libraries. If your custom uses them then it would already work.
If you need this for a new function that you implement, then you just need to implement it using it with the proper c flags to make sure it compiles on cpus that do not support them. All currently implemented code can be found in this folder. And you can find the dispatcher function detectHostSIMDExtensions for different architectures.
For AVX-512 (which is not used) in particular, you can look at this issue.

So is it safe to say autograd is the computationally expensive aspect of training? What if I wanted to ship a custom forward/reverse function to a different host or cluster for offloading, instead of CUDA; is that a reasonable architecture for experimenting with non-GPU acceleration?

I have been looking at GPU performance with PyTorch on my GTX1060 here, and it doesn’t seem like PyTorch is achieving high GPU occupancy. Or at least my numbers are showing less than ~35% GPU utilization, and I am able to train two or three models at once before I get close to 95%+. So I am wondering if I would be able to get close to GPU acceleration with MPI for example, if I had a cluster to offload a custom forward/backwards function to over the network.


What is expensive depends a lot on what you are running.

  • If you perform operations on big things (like images or larger tensors) or with big batch sizes, then you can consider that the runtime will be 40% forward, 60% backward (for operations that are usually used for NN like conv/linear/relu…).
  • If you perform small operation most of the time, then the runtime is quite hard to predict as it will depend a lot on which operations you’re doing. If your total forward pass is very fast (in the order of the microsecond) then you will start to see the overhead of the autograd engine/python binding. And if you use gpu, you will have a noticeable overhead from just launching the kernels on top of the rest.

If the GPU usage is around 35% then I guess you’re either performing many small operations or you have a lot of other stuff happening in your code on top of just the "forward/backward"s.
Which one is it? Could you describe in more details what kind of things you’re running?

Right now I’ve been experimenting with a couple of different GAN and VAE models; the GANs are roughly twice the GPU utilization which I assume is due to the dual generator and discriminator networks, but even then the models rarely peak beyond 45% or so.

I haven’t spent a lot of time on the performance profiling aspect of this, I just assumed that something as computationally expensive as a GAN would utilize more GPU resources than what I am seeing.

The main problem is that if you forward a single image at a time, you have a very sequential process where each step is a rather small operation to do. So you are stuck doing a bunch of small operations one after the other even though the total model is relatively big.