During profiling PyTorch 1.9.0 we realize that there is a lot of the C++ code and 3rd party libraries that it used. At the moment of 2021, consists of $5.6M$ lines of code and it’s pretty big.
One set that we have used was a single GPU, and in the case of (near) standard training ResNet type like model on CIFAR-10 during using NVVP we have observed that there are too much
cudaGetDevice calls. During training number of calls of “cudaGetDevice()” was totally 6 115 338 calls. And it eats 5 seconds from all 110 seconds of training, which 5% of the time.
There are places within a PyTorch where this is called. Instead of calling that it’s better to have just control over the variable which states which is a current GPU.
We realize that it’s in complicated software such as pooling may have a place to be. Of course, if you have control it’s better not to poll it.
But in any case - in the case of using a single GPU all cudaGetDevice() calls are waste of time and they are not needed, because all libraries contexts and all memory and all stream should be associated with a single device.