GPU inference with high CPU usage?

Hi all,

after copying my model to GPU with network->to(device) - network derived from torch::nn::Module, I get a huge (as expected) performance boost (both cuda::is_available() and cuda::cudnn_is_available() is true). When trying to test the perfomance, however, by simply putting a loop around the call network->forward trying to maximize the number of calls per second, I can see that GPU utilization goes only to 60 percent and CPU is about 20 percent (!). GPU usage is clearly limited by CPU doing some unknown stuff.

Expected result (tested with other backends) - GPU goes up to 95 - 100 percent and CPU usage stays low.

Testing on Windows with latest libtorch build.

Behaviour is easily reproducible, has anyone an idea what is going on here?? Not being able to fully utilize the GPU is a problem, because performance is lower than it could be. My model mostly contains Conv2d and BatchNorm layers.



Could be that your data loading is the bottleneck. Try to use the dataloader with multiple workers so that the dataset doesn’t not get loaded into the main process and sits ready for the next forward pass through the GPU.

My model mostly contains Conv2d and BatchNorm layers.

Not sure if you are doing convolutions on the CPU? Actually, that would be very different/way more inefficient than on GPU because the convolution algorithms on CPU and GPU are implemented very differently.

Well, not that simple ;). What I found out: My model consists of a about 200 layers, when running on GPU, the GPU execution is extremely fast for individual layers, so that the CPU overhead of the “forward” calls (all the Sequentials in my model) becomes prominent and finally increases the CPU usage! So when running the forward() method of my module in a loop to test maximum throughput, I cannot even fully utilize the GPU for this reason - GPU load goes up to 60-70 percent, but the loop consumes an entire CPU core so that it cannot run the model faster…

Anyone experienced this when running model in eval() on GPU?


So, after reviewing some ongoing work like this for example, it seems to me that with some models it might be really problematic to fully utilize the GPU, because CPU just gets killed before GPU can be fully utilized…

Can anybody confirm or comment?



I have only tried on CPU so I’m not sure if it will help,
but have you checked if openmp is disabled?
In my case it used up all CPU resources when openmp is enabled, but less than 1 percent when disabled.

Well, I am running on GPU, so openmp should be irrelevant in my case (hopefully).