Model is 10x slower on PyTorch than on Tensorflow

Hi,

Due to the constraints of Tensorflow I decided to port my code to PyTorch. In doing so I chose Ignite as my framework.

Overall I nearly finished the porting, but my network is 10 times slower in PyTorch and I cannot figure out why.

I have briefly discussed it in this github issue and I ran profiling with APEX’s PyProf and I uploaded the output in my git repo.

There are the following profilings done:

  • net_independent.sql.gz - which was run on the profiling.py with the start-stop flags of PyProf
  • net_independent.dict.gz - which was run on the profiling.py without the start-stop flags of PyProf and I also parsed it from an SQL to a dict as per PyProf’s examples.

The most useful thing that I found was this:

                                                            min      mean           std       max      total   count
kShortName                                                                                                           
cudnn::detail::wgrad_alg1_nd_float_engine           1.622400e-05  0.012991  2.407226e-02  0.086074  70.593751    5434
cudnn::detail::implicit_convolveNd_sgemm            3.795100e-05  0.005770  8.815605e-03  0.033361  23.240455    4028
volta_scudnn_128x64_stridedB_splitK_small_nn_v1     3.494400e-05  0.004681  1.163572e-02  0.046304  21.624084    4620
cudnn::detail::convolveNd_wgrad_engine              1.369600e-05  0.009364  1.378996e-01  2.297072  20.919318    2234
cudnn::detail::convolveNd_dgrad_float_engine        1.257600e-05  0.002649  5.326334e-03  0.046430  14.804468    5588
elementwise_kernel                                  9.280000e-07  0.000041  1.653503e-04  0.001723  11.703452  283920
avg_pool3d_cuda_update_output                       3.328000e-06  0.000666  1.345273e-03  0.005348   3.355386    5040
avg_pool3d_single_backward_out_frame_stride1        3.104000e-06  0.000558  1.115472e-03  0.003855   2.812430    5040
reduce_kernel                                       1.824000e-06  0.000083  1.556771e-04  0.000613   1.691088   20280
setTensor5d_kernel                                  1.248000e-06  0.000229  4.797258e-04  0.001454   1.279915    5588
kernelPointwiseApply3                               1.408000e-06  0.000251  5.185644e-04  0.001856   1.264417    5040
volta_scudnn_128x32_stridedB_splitK_xregs_large_nn  4.360067e-02  0.118861  6.979165e-02  0.176072   0.832024       7
kernelPointwiseApply2                               1.120000e-06  0.000139  3.165819e-04  0.001157   0.816679    5880
cudnn::detail::dgrad_alg1_nd_float_engine           3.263350e-04  0.047277  1.015612e-01  0.385441   0.756425      16
cudnn::gemm::setOutputKernel                        9.280000e-07  0.000131  3.467400e-04  0.001468   0.751137    5720
volta_gcgemm_32x32_tn                               2.160000e-05  0.000025  2.614317e-05  0.002271   0.418831   16426
volta_scudnn_128x128_stridedB_splitK_small_nn_v1    7.689500e-05  0.000220  8.117861e-05  0.000495   0.238825    1088
fft3d_r2c_16x16x16                                  4.224000e-06  0.000006  1.294814e-05  0.001555   0.198598   31358
fft3d_c2r_16x16x16                                  4.896000e-06  0.000006  1.387327e-06  0.000041   0.188108   31322
cudnn::gemm::computeOffsetsKernel                   1.216000e-06  0.000031  8.746352e-05  0.000370   0.174604    5692
sgemm_largek_lds64                                  1.195738e-03  0.001297  3.165635e-05  0.001402   0.155656     120
transpose_readWrite_alignment_kernel                1.503000e-06  0.000002  1.791118e-05  0.003037   0.144902   62682
volta_sgemm_32x32_sliced1x4_nn                      1.225600e-05  0.000269  3.576056e-04  0.000811   0.096964     360
gemv2T_kernel_val                                   6.335000e-06  0.000006  2.950772e-07  0.000016   0.095937   14896
volta_scudnn_128x64_stridedB_splitK_medium_nn_v1    1.886720e-02  0.019069  1.830427e-04  0.019249   0.095347       5
volta_sgemm_128x32_tn                               1.484800e-05  0.000283  2.680396e-04  0.000553   0.067963     240
THCudaTensor_scatterFillKernel                      2.272000e-06  0.000076  1.031451e-04  0.000230   0.027233     360
cudnn::gemm::computeBOffsetsKernel                  9.590000e-07  0.000001  2.185772e-07  0.000002   0.006775    5720
volta_sgemm_32x32_sliced1x4_nt                      1.078400e-05  0.000028  1.699259e-05  0.000057   0.006748     240
volta_sgemm_32x32_sliced1x4_tn                      9.055000e-06  0.000009  2.104289e-07  0.000011   0.001140     120
cudnn::gemm::computeWgradOffsetsKernel              1.696000e-06  0.000015  2.410338e-05  0.000072   0.000415      28
scal_kernel                                         1.504000e-06  0.000002  9.983884e-08  0.000002   0.000191     120

Showing that the operation that hogs most of the time is cudnn::detail::wgrad_alg1_nd_float_engine.

The problem is that I do not know what to do next since this is the first time I face this issue.

A co-worker of mine which ran it on a Pascal card said that it worked fine, but on my Volta card is having issues.

Please advise me on what to do. Thanks and stay safe!

Double post from here. Solved via cudnn.benchmark = True.

1 Like