Feedback about PyTorch profiling - too many cudaGetDevice() calls

Konstantin_Burlachen · December 1, 2021, 1:28pm

During profiling PyTorch 1.9.0 we realize that there is a lot of the C++ code and 3rd party libraries that it used. At the moment of 2021, consists of $5.6M$ lines of code and it’s pretty big.

One set that we have used was a single GPU, and in the case of (near) standard training ResNet type like model on CIFAR-10 during using NVVP we have observed that there are too much cudaGetDevice calls. During training number of calls of “cudaGetDevice()” was totally 6 115 338 calls. And it eats 5 seconds from all 110 seconds of training, which 5% of the time.

There are places within a PyTorch where this is called. Instead of calling that it’s better to have just control over the variable which states which is a current GPU.

We realize that it’s in complicated software such as pooling may have a place to be. Of course, if you have control it’s better not to poll it.

But in any case - in the case of using a single GPU all cudaGetDevice() calls are waste of time and they are not needed, because all libraries contexts and all memory and all stream should be associated with a single device.

tom · December 2, 2021, 9:50am

Hi!

With the help of the one and only @ptrblck : this has come up before and it seems that it is mostly a profiling artefact:

github.com/pytorch/pytorch

Excessive call to cudaGetDevice and cudaSetDevice

opened 12:27AM - 15 Mar 19 UTC

closed 07:46PM - 18 Mar 19 UTC

omry

## 🐛 Bug While profiling my code, I noticed millions of calls to cudaGetDevic…e which were slowing down my code significantly (it's an RL use case and the model I am using is small). I have a minimal repro that demonstrate that a single forward call on nn.Linear() is triggering 11 cudaGetDevice() calls and 3 cudaSetDevice() calls. ## To Reproduce See attached code snippet. Steps to reproduce the behavior: /usr/local/cuda/bin/nvprof --profile-from-start off python cudagetdevice.py Output: ``` $ /usr/local/cuda/bin/nvprof --profile-from-start off python cudagetdevice.py ==112656== NVPROF is profiling process 112656, command: python cudagetdevice.py ==112656== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version ==112656== Profiling application: python cudagetdevice.py ==112656== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 73.95% 5.0880us 1 5.0880us 5.0880us 5.0880us sgemm_32x32x32_NT_vec 26.05% 1.7920us 1 1.7920us 1.7920us 1.7920us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00005244_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_00005244_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIffE5applyERNS_6TensorERKS6_EUlRfRKfE_ffjLi1ELi2ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T_ API calls: 99.99% 1.03481s 2 517.41ms 12.000us 1.03480s cudaLaunchKernel 0.01% 67.541us 11 6.1400us 4.5000us 13.500us cudaGetDevice 0.00% 14.583us 3 4.8610us 4.5000us 5.5830us cudaSetDevice 0.00% 9.0410us 2 4.5200us 4.5000us 4.5410us cudaGetLastError ``` ## Expected behavior Number of calls to cudaGetDevice should be minimal, I would argue that it should not be called at most once, if that. ## Environment PyTorch version: 1.0.1.post2 Is debug build: No CUDA used to build PyTorch: 10.0.130 OS: Ubuntu 16.04.6 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1 Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Quadro P5000 GPU 1: Quadro P5000 Nvidia driver version: 418.39 cuDNN version: Could not collect Versions of relevant libraries: [pip] numpy==1.15.4 [pip] torch==1.0.1.post2 [pip] torchfile==0.1.0 [pip] torchvision==0.2.2 [conda] blas 1.0 mkl [conda] mkl 2019.1 144 [conda] mkl_fft 1.0.6 py36hd81dba3_0 [conda] mkl_random 1.0.2 py36hd81dba3_0 [conda] pytorch 1.0.1 py3.6_cuda10.0.130_cudnn7.4.2_2 pytorch [conda] torchfile 0.1.0 <pip> [conda] torchvision 0.2.2 py_3 pytorch - PyTorch Version (e.g., 1.0): 1.0.1 - OS (e.g., Linux): Ubuntu 16.04 - How you installed PyTorch (`conda`, `pip`, source): pip inside a conda env - Build command you used (if compiling from source): - Python version: 3.6 - CUDA/cuDNN version: 10.0 - GPU models and configuration: Quadro P5000

Of course, you could re-run the experiment of replacing all calls to cudaGetDevice/cudaSetDevice with a noop and see if things have changed.

Best regards

Thomas