after copying my model to GPU with network->to(device) - network derived from torch::nn::Module, I get a huge (as expected) performance boost (both cuda::is_available() and cuda::cudnn_is_available() is true). When trying to test the perfomance, however, by simply putting a loop around the call network->forward trying to maximize the number of calls per second, I can see that GPU utilization goes only to 60 percent and CPU is about 20 percent (!). GPU usage is clearly limited by CPU doing some unknown stuff.
Expected result (tested with other backends) - GPU goes up to 95 - 100 percent and CPU usage stays low.
Testing on Windows with latest libtorch build.
Behaviour is easily reproducible, has anyone an idea what is going on here?? Not being able to fully utilize the GPU is a problem, because performance is lower than it could be. My model mostly contains Conv2d and BatchNorm layers.