Model is 10x slower on PyTorch than on Tensorflow

DanielTudosiu · March 22, 2020, 1:14am

Hi,

Due to the constraints of Tensorflow I decided to port my code to PyTorch. In doing so I chose Ignite as my framework.

Overall I nearly finished the porting, but my network is 10 times slower in PyTorch and I cannot figure out why.

I have briefly discussed it in this github issue and I ran profiling with APEX’s PyProf and I uploaded the output in my git repo.

There are the following profilings done:

net_independent.sql.gz - which was run on the profiling.py with the start-stop flags of PyProf
net_independent.dict.gz - which was run on the profiling.py without the start-stop flags of PyProf and I also parsed it from an SQL to a dict as per PyProf’s examples.

The most useful thing that I found was this:

                                                            min      mean           std       max      total   count
kShortName                                                                                                           
cudnn::detail::wgrad_alg1_nd_float_engine           1.622400e-05  0.012991  2.407226e-02  0.086074  70.593751    5434
cudnn::detail::implicit_convolveNd_sgemm            3.795100e-05  0.005770  8.815605e-03  0.033361  23.240455    4028
volta_scudnn_128x64_stridedB_splitK_small_nn_v1     3.494400e-05  0.004681  1.163572e-02  0.046304  21.624084    4620
cudnn::detail::convolveNd_wgrad_engine              1.369600e-05  0.009364  1.378996e-01  2.297072  20.919318    2234
cudnn::detail::convolveNd_dgrad_float_engine        1.257600e-05  0.002649  5.326334e-03  0.046430  14.804468    5588
elementwise_kernel                                  9.280000e-07  0.000041  1.653503e-04  0.001723  11.703452  283920
avg_pool3d_cuda_update_output                       3.328000e-06  0.000666  1.345273e-03  0.005348   3.355386    5040
avg_pool3d_single_backward_out_frame_stride1        3.104000e-06  0.000558  1.115472e-03  0.003855   2.812430    5040
reduce_kernel                                       1.824000e-06  0.000083  1.556771e-04  0.000613   1.691088   20280
setTensor5d_kernel                                  1.248000e-06  0.000229  4.797258e-04  0.001454   1.279915    5588
kernelPointwiseApply3                               1.408000e-06  0.000251  5.185644e-04  0.001856   1.264417    5040
volta_scudnn_128x32_stridedB_splitK_xregs_large_nn  4.360067e-02  0.118861  6.979165e-02  0.176072   0.832024       7
kernelPointwiseApply2                               1.120000e-06  0.000139  3.165819e-04  0.001157   0.816679    5880
cudnn::detail::dgrad_alg1_nd_float_engine           3.263350e-04  0.047277  1.015612e-01  0.385441   0.756425      16
cudnn::gemm::setOutputKernel                        9.280000e-07  0.000131  3.467400e-04  0.001468   0.751137    5720
volta_gcgemm_32x32_tn                               2.160000e-05  0.000025  2.614317e-05  0.002271   0.418831   16426
volta_scudnn_128x128_stridedB_splitK_small_nn_v1    7.689500e-05  0.000220  8.117861e-05  0.000495   0.238825    1088
fft3d_r2c_16x16x16                                  4.224000e-06  0.000006  1.294814e-05  0.001555   0.198598   31358
fft3d_c2r_16x16x16                                  4.896000e-06  0.000006  1.387327e-06  0.000041   0.188108   31322
cudnn::gemm::computeOffsetsKernel                   1.216000e-06  0.000031  8.746352e-05  0.000370   0.174604    5692
sgemm_largek_lds64                                  1.195738e-03  0.001297  3.165635e-05  0.001402   0.155656     120
transpose_readWrite_alignment_kernel                1.503000e-06  0.000002  1.791118e-05  0.003037   0.144902   62682
volta_sgemm_32x32_sliced1x4_nn                      1.225600e-05  0.000269  3.576056e-04  0.000811   0.096964     360
gemv2T_kernel_val                                   6.335000e-06  0.000006  2.950772e-07  0.000016   0.095937   14896
volta_scudnn_128x64_stridedB_splitK_medium_nn_v1    1.886720e-02  0.019069  1.830427e-04  0.019249   0.095347       5
volta_sgemm_128x32_tn                               1.484800e-05  0.000283  2.680396e-04  0.000553   0.067963     240
THCudaTensor_scatterFillKernel                      2.272000e-06  0.000076  1.031451e-04  0.000230   0.027233     360
cudnn::gemm::computeBOffsetsKernel                  9.590000e-07  0.000001  2.185772e-07  0.000002   0.006775    5720
volta_sgemm_32x32_sliced1x4_nt                      1.078400e-05  0.000028  1.699259e-05  0.000057   0.006748     240
volta_sgemm_32x32_sliced1x4_tn                      9.055000e-06  0.000009  2.104289e-07  0.000011   0.001140     120
cudnn::gemm::computeWgradOffsetsKernel              1.696000e-06  0.000015  2.410338e-05  0.000072   0.000415      28
scal_kernel                                         1.504000e-06  0.000002  9.983884e-08  0.000002   0.000191     120

Showing that the operation that hogs most of the time is cudnn::detail::wgrad_alg1_nd_float_engine.

The problem is that I do not know what to do next since this is the first time I face this issue.

A co-worker of mine which ran it on a Pascal card said that it worked fine, but on my Volta card is having issues.

Please advise me on what to do. Thanks and stay safe!

ptrblck · March 23, 2020, 3:09am

Double post from here. Solved via cudnn.benchmark = True.