Feedback about PyTorch profiling - too many cudaGetDevice() calls

During profiling PyTorch 1.9.0 we realize that there is a lot of the C++ code and 3rd party libraries that it used. At the moment of 2021, consists of $5.6M$ lines of code and it’s pretty big.

One set that we have used was a single GPU, and in the case of (near) standard training ResNet type like model on CIFAR-10 during using NVVP we have observed that there are too much cudaGetDevice calls. During training number of calls of “cudaGetDevice()” was totally 6 115 338 calls. And it eats 5 seconds from all 110 seconds of training, which 5% of the time.

There are places within a PyTorch where this is called. Instead of calling that it’s better to have just control over the variable which states which is a current GPU.

We realize that it’s in complicated software such as pooling may have a place to be. Of course, if you have control it’s better not to poll it.

But in any case - in the case of using a single GPU all cudaGetDevice() calls are waste of time and they are not needed, because all libraries contexts and all memory and all stream should be associated with a single device.

Hi!

With the help of the one and only @ptrblck : this has come up before and it seems that it is mostly a profiling artefact:

Of course, you could re-run the experiment of replacing all calls to cudaGetDevice/cudaSetDevice with a noop and see if things have changed.

Best regards

Thomas

1 Like