During profiling PyTorch 1.9.0 we realize that there is a lot of the C++ code and 3rd party libraries that it used. At the moment of 2021, consists of $5.6M$ lines of code and it’s pretty big.
One set that we have used was a single GPU, and in the case of (near) standard training ResNet type like model on CIFAR-10 during using NVVP we have observed that there are too much cudaGetDevice
calls. During training number of calls of “cudaGetDevice()” was totally 6 115 338 calls. And it eats 5 seconds from all 110 seconds of training, which 5% of the time.
There are places within a PyTorch where this is called. Instead of calling that it’s better to have just control over the variable which states which is a current GPU.
We realize that it’s in complicated software such as pooling may have a place to be. Of course, if you have control it’s better not to poll it.
But in any case - in the case of using a single GPU all cudaGetDevice() calls are waste of time and they are not needed, because all libraries contexts and all memory and all stream should be associated with a single device.
tom
(Thomas V)
December 2, 2021, 9:50am
2
Hi!
With the help of the one and only @ptrblck : this has come up before and it seems that it is mostly a profiling artefact:
opened 12:27AM - 15 Mar 19 UTC
closed 07:46PM - 18 Mar 19 UTC
## 🐛 Bug
While profiling my code, I noticed millions of calls to cudaGetDevic… e which were slowing down my
code significantly (it's an RL use case and the model I am using is small).
I have a minimal repro that demonstrate that a single forward call on nn.Linear() is triggering 11 cudaGetDevice() calls and 3 cudaSetDevice() calls.
## To Reproduce
See attached code snippet.
Steps to reproduce the behavior:
/usr/local/cuda/bin/nvprof --profile-from-start off python cudagetdevice.py
Output:
```
$ /usr/local/cuda/bin/nvprof --profile-from-start off python cudagetdevice.py
==112656== NVPROF is profiling process 112656, command: python cudagetdevice.py
==112656== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
==112656== Profiling application: python cudagetdevice.py
==112656== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 73.95% 5.0880us 1 5.0880us 5.0880us 5.0880us sgemm_32x32x32_NT_vec
26.05% 1.7920us 1 1.7920us 1.7920us 1.7920us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00005244_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_00005244_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIffE5applyERNS_6TensorERKS6_EUlRfRKfE_ffjLi1ELi2ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T_
API calls: 99.99% 1.03481s 2 517.41ms 12.000us 1.03480s cudaLaunchKernel
0.01% 67.541us 11 6.1400us 4.5000us 13.500us cudaGetDevice
0.00% 14.583us 3 4.8610us 4.5000us 5.5830us cudaSetDevice
0.00% 9.0410us 2 4.5200us 4.5000us 4.5410us cudaGetLastError
```
## Expected behavior
Number of calls to cudaGetDevice should be minimal, I would argue that it should not be called at most once, if that.
## Environment
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro P5000
GPU 1: Quadro P5000
Nvidia driver version: 418.39
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy==1.15.4
[pip] torch==1.0.1.post2
[pip] torchfile==0.1.0
[pip] torchvision==0.2.2
[conda] blas 1.0 mkl
[conda] mkl 2019.1 144
[conda] mkl_fft 1.0.6 py36hd81dba3_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] pytorch 1.0.1 py3.6_cuda10.0.130_cudnn7.4.2_2 pytorch
[conda] torchfile 0.1.0 <pip>
[conda] torchvision 0.2.2 py_3 pytorch
- PyTorch Version (e.g., 1.0):
1.0.1
- OS (e.g., Linux): Ubuntu 16.04
- How you installed PyTorch (`conda`, `pip`, source):
pip inside a conda env
- Build command you used (if compiling from source):
- Python version:
3.6
- CUDA/cuDNN version:
10.0
- GPU models and configuration:
Quadro P5000
Of course, you could re-run the experiment of replacing all calls to cudaGetDevice/cudaSetDevice with a noop and see if things have changed.
Best regards
Thomas
1 Like