Hi!
So Ive been playing with pytorch lately and I end up with a model that seems poorly optimized. The gpu usage is around 30% all the time during training and depending on batch_size the time required to run a epoch can be drastically reduced (if batch_size is high) or long (if batch_size is small).
The full code sits in the following repository https://github.com/jiwidi/las-pytorch.
The training loop is located here and the “juicy” part of the code where training on gpu happens is located here
I also attach here the output of calling my script with torch.utils.bottleneck for 51 minibatches training for both train and validation dataset (a full epoch takes almost a full day, the dataset is 1.000 of Speech2Text)
`bottleneck` is a tool that can be used as an initial step for debugging
bottlenecks in your program.
It summarizes runs of your script with the Python profiler and PyTorch's
autograd profiler. Because your script will be profiled, please ensure that it
exits in a finite amount of time.
For more complicated uses of the profilers, please see
https://docs.python.org/3/library/profile.html and
https://pytorch.org/docs/master/autograd.html#profiler for more information.
Running environment analysis...
Running your script with cProfile
Running your script with the autograd profiler...
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.5.0 compiled w/ CUDA 10.2
Running with Python 3.6 and
`pip list` truncated output:
numpy==1.18.1
torch==1.5.0
torchaudio==0.5.0a0+3305d5c
torchvision==0.6.0a0+82fd1c8
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
3139301 function calls (3045612 primitive calls) in 48.917 seconds
Ordered by: internal time
List reduced from 2862 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
51 31.576 0.619 31.576 0.619 {method 'run_backward' of 'torch._C._EngineBase' objects}
183 2.985 0.016 2.985 0.016 {method 'cuda' of 'torch._C._TensorBase' objects}
11606 2.903 0.000 2.903 0.000 {built-in method lstm}
11504 1.034 0.000 1.034 0.000 {method 'repeat' of 'torch._C._TensorBase' objects}
23110 0.907 0.000 0.907 0.000 {built-in method cat}
11504 0.803 0.000 4.975 0.000 /mnt/04DC6FC8DC6FB31A/github/las-pytorch/model/las_model.py:275(forward)
23008 0.750 0.000 0.750 0.000 {built-in method addmm}
3 0.478 0.159 0.478 0.159 {built-in method _cudnn_rnn_flatten_weight}
51 0.451 0.009 0.451 0.009 {built-in method torch._C._cuda_emptyCache}
362 0.442 0.001 0.442 0.001 {method 'acquire' of '_thread.lock' objects}
2 0.402 0.201 0.821 0.411 /mnt/04DC6FC8DC6FB31A/github/las-pytorch/utils/data.py:35(__init__)
23008 0.384 0.000 0.384 0.000 {built-in method relu}
11504 0.351 0.000 0.351 0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
34512 0.342 0.000 1.786 0.000 /home/jiwidi/anaconda3/envs/torch-gpu/lib/python3.6/site-packages/torch/nn/functional.py:1591(linear)
51 0.304 0.006 9.883 0.194 /mnt/04DC6FC8DC6FB31A/github/las-pytorch/model/las_model.py:186(forward)
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
lstm 7.66% 26.315ms 7.66% 26.315ms 26.315ms NaN 0.000us 0.000us 1 []
_cudnn_rnn 7.65% 26.277ms 7.65% 26.277ms 26.277ms NaN 0.000us 0.000us 1 []
CudnnRnnBackward 7.19% 24.689ms 7.19% 24.689ms 24.689ms NaN 0.000us 0.000us 1 []
_cudnn_rnn_backward 7.19% 24.672ms 7.19% 24.672ms 24.672ms NaN 0.000us 0.000us 1 []
lstm 7.08% 24.296ms 7.08% 24.296ms 24.296ms NaN 0.000us 0.000us 1 []
_cudnn_rnn 7.07% 24.258ms 7.07% 24.258ms 24.258ms NaN 0.000us 0.000us 1 []
lstm 6.69% 22.980ms 6.69% 22.980ms 22.980ms NaN 0.000us 0.000us 1 []
_cudnn_rnn 6.68% 22.943ms 6.68% 22.943ms 22.943ms NaN 0.000us 0.000us 1 []
lstm 6.38% 21.893ms 6.38% 21.893ms 21.893ms NaN 0.000us 0.000us 1 []
_cudnn_rnn 6.37% 21.854ms 6.37% 21.854ms 21.854ms NaN 0.000us 0.000us 1 []
lstm 6.19% 21.240ms 6.19% 21.240ms 21.240ms NaN 0.000us 0.000us 1 []
_cudnn_rnn 6.18% 21.202ms 6.18% 21.202ms 21.202ms NaN 0.000us 0.000us 1 []
CudnnRnnBackward 5.92% 20.315ms 5.92% 20.315ms 20.315ms NaN 0.000us 0.000us 1 []
_cudnn_rnn_backward 5.91% 20.291ms 5.91% 20.291ms 20.291ms NaN 0.000us 0.000us 1 []
lstm 5.85% 20.087ms 5.85% 20.087ms 20.087ms NaN 0.000us 0.000us 1 []
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 343.311ms
CUDA time total: 0.000us
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
CudnnRnnBackward 7.13% 25.220ms 7.13% 25.220ms 25.220ms 7.13% 25.248ms 25.248ms 1 []
_cudnn_rnn_backward 7.12% 25.187ms 7.12% 25.187ms 25.187ms 7.13% 25.228ms 25.228ms 1 []
lstm 7.05% 24.953ms 7.05% 24.953ms 24.953ms 7.07% 25.008ms 25.008ms 1 []
CudnnRnnBackward 7.01% 24.826ms 7.01% 24.826ms 24.826ms 6.97% 24.672ms 24.672ms 1 []
_cudnn_rnn 7.01% 24.814ms 7.01% 24.814ms 24.814ms 7.03% 24.880ms 24.880ms 1 []
_cudnn_rnn_backward 7.00% 24.792ms 7.00% 24.792ms 24.792ms 6.96% 24.648ms 24.648ms 1 []
lstm 7.00% 24.772ms 7.00% 24.772ms 24.772ms 7.00% 24.770ms 24.770ms 1 []
_cudnn_rnn 6.98% 24.708ms 6.98% 24.708ms 24.708ms 6.98% 24.709ms 24.709ms 1 []
lstm 6.47% 22.903ms 6.47% 22.903ms 22.903ms 6.47% 22.901ms 22.901ms 1 []
lstm 6.46% 22.866ms 6.46% 22.866ms 22.866ms 6.46% 22.864ms 22.864ms 1 []
_cudnn_rnn 6.45% 22.814ms 6.45% 22.814ms 22.814ms 6.45% 22.816ms 22.816ms 1 []
_cudnn_rnn 6.44% 22.798ms 6.44% 22.798ms 22.798ms 6.44% 22.798ms 22.798ms 1 []
lstm 5.97% 21.120ms 5.97% 21.120ms 21.120ms 5.98% 21.180ms 21.180ms 1 []
lstm 5.97% 21.114ms 5.97% 21.114ms 21.114ms 5.97% 21.112ms 21.112ms 1 []
_cudnn_rnn 5.95% 21.062ms 5.95% 21.062ms 21.062ms 5.95% 21.064ms 21.064ms 1 []
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 353.950ms
CUDA time total: 353.898ms
By looking at the results seems like it takes the same time to train on cpu mode than with cuda (with a rtx 2060). There must be something wrong…
Im failry new to pytorch so I apreciate any help with help to debug the cpu bottlenecks my code could have.
Thanks