How to understand 'torch.utils.bottleneck' result?

I have checked all the topics about torch.utils.bottleneck and there was no such a clear explanation about how to interpret the result. Even the torch.utils.bottleneck docs does not help you to understand the result.

My result is as follows :

--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.4.0 compiled w/ CUDA 10.1
Running with Python 2.7 and CUDA 10.0.130

`pip list` truncated output:
numpy==1.16.5
numpydoc==0.9.1
torch==1.4.0
torchvision==0.5.0
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         35874438 function calls (35431823 primitive calls) in 281.094 seconds

   Ordered by: internal time
   List reduced from 3311 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    21561   49.227    0.002   49.227    0.002 {method 'argsort' of 'numpy.ndarray' objects}
   215076   36.523    0.000   36.523    0.000 {method 'sort' of 'numpy.ndarray' objects}
        1   17.219   17.219   22.108   22.108 xx_utils.py:150(split_train_test_clusters)
      394   11.803    0.030   11.803    0.030 {method 'run_backward' of 'torch._C._EngineBase' objects}
    11696    8.537    0.001    8.537    0.001 {method 'cuda' of 'torch._C._TensorBase' objects}
       27    7.460    0.276  175.699    6.507 xx.py:109(test)
   825939    7.426    0.000    7.426    0.000 {method 'readline' of 'file' objects}
    21507    6.434    0.000   82.410    0.004 ~/python2.7/site-packages/sklearn/metrics/ranking.py:359(_binary_clf_curve)
  1617818    5.822    0.000    5.822    0.000 {numpy.array}
    43014    5.185    0.000    5.185    0.000 {numpy.where}
        9    5.156    0.573  244.712   27.190 xx.py (train_and_eval)
    76866    4.598    0.000    4.598    0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
   191484    4.341    0.000    4.341    0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
      394    4.252    0.011   16.535    0.042 ~/python2.7/site-packages/torch/optim/adam.py:49(step)
   193569    3.981    0.000   43.295    0.000 ~/python2.7/site-packages/numpy/lib/arraysetops.py:299(_unique1d)


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
item                     7.91%            54.446ms         7.91%            54.446ms         54.446ms         NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      7.91%            54.444ms         7.91%            54.444ms         54.444ms         NaN              0.000us          0.000us          1                []                                   
item                     7.31%            50.323ms         7.31%            50.323ms         50.323ms         NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      7.31%            50.321ms         7.31%            50.321ms         50.321ms         NaN              0.000us          0.000us          1                []                                   
item                     6.76%            46.515ms         6.76%            46.515ms         46.515ms         NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      6.76%            46.512ms         6.76%            46.512ms         46.512ms         NaN              0.000us          0.000us          1                []                                   
item                     6.45%            44.411ms         6.45%            44.411ms         44.411ms         NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      6.45%            44.408ms         6.45%            44.408ms         44.408ms         NaN              0.000us          0.000us          1                []                                   
item                     6.27%            43.158ms         6.27%            43.158ms         43.158ms         NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      6.27%            43.156ms         6.27%            43.156ms         43.156ms         NaN              0.000us          0.000us          1                []                                   
to                       6.20%            42.655ms         6.20%            42.655ms         42.655ms         NaN              0.000us          0.000us          1                []                                   
item                     6.12%            42.106ms         6.12%            42.106ms         42.106ms         NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      6.12%            42.103ms         6.12%            42.103ms         42.103ms         NaN              0.000us          0.000us          1                []                                   
item                     6.09%            41.887ms         6.09%            41.887ms         41.887ms         NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      6.08%            41.884ms         6.08%            41.884ms         41.884ms         NaN              0.000us          0.000us          1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 688.328ms
CUDA time total: 0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
to                    8.13%            53.694ms         8.13%            53.694ms         53.694ms         8.36%            53.696ms         53.696ms         1                []                                   
matmul                7.95%            52.501ms         7.95%            52.501ms         52.501ms         8.18%            52.512ms         52.512ms         1                []                                   
view                  7.92%            52.285ms         7.92%            52.285ms         52.285ms         8.15%            52.320ms         52.320ms         1                []                                   
to                    7.29%            48.109ms         7.29%            48.109ms         48.109ms         7.49%            48.112ms         48.112ms         1                []                                   
to                    7.09%            46.830ms         7.09%            46.830ms         46.830ms         7.29%            46.832ms         46.832ms         1                []                                   
to                    6.86%            45.284ms         6.86%            45.284ms         45.284ms         7.05%            45.280ms         45.280ms         1                []                                   
MmBackward            6.57%            43.352ms         6.57%            43.352ms         43.352ms         6.06%            38.912ms         38.912ms         1                []                                   
mm                    6.49%            42.819ms         6.49%            42.819ms         42.819ms         6.05%            38.880ms         38.880ms         1                []                                   
to                    6.32%            41.693ms         6.32%            41.693ms         41.693ms         6.49%            41.680ms         41.680ms         1                []                                   
SigmoidBackward       6.15%            40.570ms         6.15%            40.570ms         40.570ms         5.57%            35.756ms         35.756ms         1                []                                   
sigmoid_backward      6.14%            40.551ms         6.14%            40.551ms         40.551ms         5.57%            35.752ms         35.752ms         1                []                                   
to                    6.01%            39.649ms         6.01%            39.649ms         39.649ms         6.17%            39.648ms         39.648ms         1                []                                   
to                    5.76%            38.014ms         5.76%            38.014ms         38.014ms         5.92%            38.024ms         38.024ms         1                []                                   
to                    5.71%            37.693ms         5.71%            37.693ms         37.693ms         5.87%            37.680ms         37.680ms         1                []                                   
to                    5.61%            37.059ms         5.61%            37.059ms         37.059ms         5.77%            37.056ms         37.056ms         1                []                                   
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 660.104ms
CUDA time total: 642.140ms

And this is what I understood from the result :

From the ‘cProfile output’, I might need to improve ‘argsort’ and ‘sort’ method in the code.

From ‘autograd profiler output (CPU mode)’, ‘item’ and ‘_local_scalar_dense’ take most of the cpu time

From ’ … (CUDA mode)’, ‘to’, ‘matrix multiplication’ and ‘view(?)’ take most amount of GPU time.

By the way, both in CPU mode and GPU mode, ‘Self CPU time total’ are quite same as 688ms and 660ms, respectively. However, ‘CUDA time total’ is 642ms in GPU mode. Does that mean anything?

I mean Is there any good ratio or inequality that those times need to be like?

Thanks in advance.

Leo