GPU not fully used, how to optimize the code

jiwidi · June 7, 2020, 1:03pm

Hi!

So Ive been playing with pytorch lately and I end up with a model that seems poorly optimized. The gpu usage is around 30% all the time during training and depending on batch_size the time required to run a epoch can be drastically reduced (if batch_size is high) or long (if batch_size is small).

The full code sits in the following repository https://github.com/jiwidi/las-pytorch.

The training loop is located here and the “juicy” part of the code where training on gpu happens is located here

I also attach here the output of calling my script with torch.utils.bottleneck for 51 minibatches training for both train and validation dataset (a full epoch takes almost a full day, the dataset is 1.000 of Speech2Text)

`bottleneck` is a tool that can be used as an initial step for debugging
bottlenecks in your program.

It summarizes runs of your script with the Python profiler and PyTorch's
autograd profiler. Because your script will be profiled, please ensure that it
exits in a finite amount of time.

For more complicated uses of the profilers, please see
https://docs.python.org/3/library/profile.html and
https://pytorch.org/docs/master/autograd.html#profiler for more information.
Running environment analysis...
Running your script with cProfile
Running your script with the autograd profiler...

--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.5.0 compiled w/ CUDA 10.2
Running with Python 3.6 and 

`pip list` truncated output:
numpy==1.18.1
torch==1.5.0
torchaudio==0.5.0a0+3305d5c
torchvision==0.6.0a0+82fd1c8
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         3139301 function calls (3045612 primitive calls) in 48.917 seconds

   Ordered by: internal time
   List reduced from 2862 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       51   31.576    0.619   31.576    0.619 {method 'run_backward' of 'torch._C._EngineBase' objects}
      183    2.985    0.016    2.985    0.016 {method 'cuda' of 'torch._C._TensorBase' objects}
    11606    2.903    0.000    2.903    0.000 {built-in method lstm}
    11504    1.034    0.000    1.034    0.000 {method 'repeat' of 'torch._C._TensorBase' objects}
    23110    0.907    0.000    0.907    0.000 {built-in method cat}
    11504    0.803    0.000    4.975    0.000 /mnt/04DC6FC8DC6FB31A/github/las-pytorch/model/las_model.py:275(forward)
    23008    0.750    0.000    0.750    0.000 {built-in method addmm}
        3    0.478    0.159    0.478    0.159 {built-in method _cudnn_rnn_flatten_weight}
       51    0.451    0.009    0.451    0.009 {built-in method torch._C._cuda_emptyCache}
      362    0.442    0.001    0.442    0.001 {method 'acquire' of '_thread.lock' objects}
        2    0.402    0.201    0.821    0.411 /mnt/04DC6FC8DC6FB31A/github/las-pytorch/utils/data.py:35(__init__)
    23008    0.384    0.000    0.384    0.000 {built-in method relu}
    11504    0.351    0.000    0.351    0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
    34512    0.342    0.000    1.786    0.000 /home/jiwidi/anaconda3/envs/torch-gpu/lib/python3.6/site-packages/torch/nn/functional.py:1591(linear)
       51    0.304    0.006    9.883    0.194 /mnt/04DC6FC8DC6FB31A/github/las-pytorch/model/las_model.py:186(forward)


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
lstm                     7.66%            26.315ms         7.66%            26.315ms         26.315ms         NaN              0.000us          0.000us          1                []                                   
_cudnn_rnn               7.65%            26.277ms         7.65%            26.277ms         26.277ms         NaN              0.000us          0.000us          1                []                                   
CudnnRnnBackward         7.19%            24.689ms         7.19%            24.689ms         24.689ms         NaN              0.000us          0.000us          1                []                                   
_cudnn_rnn_backward      7.19%            24.672ms         7.19%            24.672ms         24.672ms         NaN              0.000us          0.000us          1                []                                   
lstm                     7.08%            24.296ms         7.08%            24.296ms         24.296ms         NaN              0.000us          0.000us          1                []                                   
_cudnn_rnn               7.07%            24.258ms         7.07%            24.258ms         24.258ms         NaN              0.000us          0.000us          1                []                                   
lstm                     6.69%            22.980ms         6.69%            22.980ms         22.980ms         NaN              0.000us          0.000us          1                []                                   
_cudnn_rnn               6.68%            22.943ms         6.68%            22.943ms         22.943ms         NaN              0.000us          0.000us          1                []                                   
lstm                     6.38%            21.893ms         6.38%            21.893ms         21.893ms         NaN              0.000us          0.000us          1                []                                   
_cudnn_rnn               6.37%            21.854ms         6.37%            21.854ms         21.854ms         NaN              0.000us          0.000us          1                []                                   
lstm                     6.19%            21.240ms         6.19%            21.240ms         21.240ms         NaN              0.000us          0.000us          1                []                                   
_cudnn_rnn               6.18%            21.202ms         6.18%            21.202ms         21.202ms         NaN              0.000us          0.000us          1                []                                   
CudnnRnnBackward         5.92%            20.315ms         5.92%            20.315ms         20.315ms         NaN              0.000us          0.000us          1                []                                   
_cudnn_rnn_backward      5.91%            20.291ms         5.91%            20.291ms         20.291ms         NaN              0.000us          0.000us          1                []                                   
lstm                     5.85%            20.087ms         5.85%            20.087ms         20.087ms         NaN              0.000us          0.000us          1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 343.311ms
CUDA time total: 0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
CudnnRnnBackward         7.13%            25.220ms         7.13%            25.220ms         25.220ms         7.13%            25.248ms         25.248ms         1                []                                   
_cudnn_rnn_backward      7.12%            25.187ms         7.12%            25.187ms         25.187ms         7.13%            25.228ms         25.228ms         1                []                                   
lstm                     7.05%            24.953ms         7.05%            24.953ms         24.953ms         7.07%            25.008ms         25.008ms         1                []                                   
CudnnRnnBackward         7.01%            24.826ms         7.01%            24.826ms         24.826ms         6.97%            24.672ms         24.672ms         1                []                                   
_cudnn_rnn               7.01%            24.814ms         7.01%            24.814ms         24.814ms         7.03%            24.880ms         24.880ms         1                []                                   
_cudnn_rnn_backward      7.00%            24.792ms         7.00%            24.792ms         24.792ms         6.96%            24.648ms         24.648ms         1                []                                   
lstm                     7.00%            24.772ms         7.00%            24.772ms         24.772ms         7.00%            24.770ms         24.770ms         1                []                                   
_cudnn_rnn               6.98%            24.708ms         6.98%            24.708ms         24.708ms         6.98%            24.709ms         24.709ms         1                []                                   
lstm                     6.47%            22.903ms         6.47%            22.903ms         22.903ms         6.47%            22.901ms         22.901ms         1                []                                   
lstm                     6.46%            22.866ms         6.46%            22.866ms         22.866ms         6.46%            22.864ms         22.864ms         1                []                                   
_cudnn_rnn               6.45%            22.814ms         6.45%            22.814ms         22.814ms         6.45%            22.816ms         22.816ms         1                []                                   
_cudnn_rnn               6.44%            22.798ms         6.44%            22.798ms         22.798ms         6.44%            22.798ms         22.798ms         1                []                                   
lstm                     5.97%            21.120ms         5.97%            21.120ms         21.120ms         5.98%            21.180ms         21.180ms         1                []                                   
lstm                     5.97%            21.114ms         5.97%            21.114ms         21.114ms         5.97%            21.112ms         21.112ms         1                []                                   
_cudnn_rnn               5.95%            21.062ms         5.95%            21.062ms         21.062ms         5.95%            21.064ms         21.064ms         1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 353.950ms
CUDA time total: 353.898ms

By looking at the results seems like it takes the same time to train on cpu mode than with cuda (with a rtx 2060). There must be something wrong…

Im failry new to pytorch so I apreciate any help with help to debug the cpu bottlenecks my code could have.

Thanks

ptrblck · June 8, 2020, 7:43am

You could try to profile the data loading and check if it might be slowing down your code using the ImageNet example.

If the data loading time is not approaching zero, you might want to take a look at this post, which discusses common issues and provides more information.

If the data loading is not an issue, you might need to increase the batch size to increase the workload on the GPU.

Also, setting torch.backends.cudnn.benchmark = True at the beginning of your script might help to select faster kernels, if your input shapes are not highly variable.