Hi,
Due to the constraints of Tensorflow I decided to port my code to PyTorch. In doing so I chose Ignite as my framework.
Overall I nearly finished the porting, but my network is 10 times slower in PyTorch and I cannot figure out why.
I have briefly discussed it in this github issue and I ran profiling with APEX’s PyProf and I uploaded the output in my git repo.
There are the following profilings done:
- net_independent.sql.gz - which was run on the profiling.py with the start-stop flags of PyProf
- net_independent.dict.gz - which was run on the profiling.py without the start-stop flags of PyProf and I also parsed it from an SQL to a dict as per PyProf’s examples.
The most useful thing that I found was this:
min mean std max total count
kShortName
cudnn::detail::wgrad_alg1_nd_float_engine 1.622400e-05 0.012991 2.407226e-02 0.086074 70.593751 5434
cudnn::detail::implicit_convolveNd_sgemm 3.795100e-05 0.005770 8.815605e-03 0.033361 23.240455 4028
volta_scudnn_128x64_stridedB_splitK_small_nn_v1 3.494400e-05 0.004681 1.163572e-02 0.046304 21.624084 4620
cudnn::detail::convolveNd_wgrad_engine 1.369600e-05 0.009364 1.378996e-01 2.297072 20.919318 2234
cudnn::detail::convolveNd_dgrad_float_engine 1.257600e-05 0.002649 5.326334e-03 0.046430 14.804468 5588
elementwise_kernel 9.280000e-07 0.000041 1.653503e-04 0.001723 11.703452 283920
avg_pool3d_cuda_update_output 3.328000e-06 0.000666 1.345273e-03 0.005348 3.355386 5040
avg_pool3d_single_backward_out_frame_stride1 3.104000e-06 0.000558 1.115472e-03 0.003855 2.812430 5040
reduce_kernel 1.824000e-06 0.000083 1.556771e-04 0.000613 1.691088 20280
setTensor5d_kernel 1.248000e-06 0.000229 4.797258e-04 0.001454 1.279915 5588
kernelPointwiseApply3 1.408000e-06 0.000251 5.185644e-04 0.001856 1.264417 5040
volta_scudnn_128x32_stridedB_splitK_xregs_large_nn 4.360067e-02 0.118861 6.979165e-02 0.176072 0.832024 7
kernelPointwiseApply2 1.120000e-06 0.000139 3.165819e-04 0.001157 0.816679 5880
cudnn::detail::dgrad_alg1_nd_float_engine 3.263350e-04 0.047277 1.015612e-01 0.385441 0.756425 16
cudnn::gemm::setOutputKernel 9.280000e-07 0.000131 3.467400e-04 0.001468 0.751137 5720
volta_gcgemm_32x32_tn 2.160000e-05 0.000025 2.614317e-05 0.002271 0.418831 16426
volta_scudnn_128x128_stridedB_splitK_small_nn_v1 7.689500e-05 0.000220 8.117861e-05 0.000495 0.238825 1088
fft3d_r2c_16x16x16 4.224000e-06 0.000006 1.294814e-05 0.001555 0.198598 31358
fft3d_c2r_16x16x16 4.896000e-06 0.000006 1.387327e-06 0.000041 0.188108 31322
cudnn::gemm::computeOffsetsKernel 1.216000e-06 0.000031 8.746352e-05 0.000370 0.174604 5692
sgemm_largek_lds64 1.195738e-03 0.001297 3.165635e-05 0.001402 0.155656 120
transpose_readWrite_alignment_kernel 1.503000e-06 0.000002 1.791118e-05 0.003037 0.144902 62682
volta_sgemm_32x32_sliced1x4_nn 1.225600e-05 0.000269 3.576056e-04 0.000811 0.096964 360
gemv2T_kernel_val 6.335000e-06 0.000006 2.950772e-07 0.000016 0.095937 14896
volta_scudnn_128x64_stridedB_splitK_medium_nn_v1 1.886720e-02 0.019069 1.830427e-04 0.019249 0.095347 5
volta_sgemm_128x32_tn 1.484800e-05 0.000283 2.680396e-04 0.000553 0.067963 240
THCudaTensor_scatterFillKernel 2.272000e-06 0.000076 1.031451e-04 0.000230 0.027233 360
cudnn::gemm::computeBOffsetsKernel 9.590000e-07 0.000001 2.185772e-07 0.000002 0.006775 5720
volta_sgemm_32x32_sliced1x4_nt 1.078400e-05 0.000028 1.699259e-05 0.000057 0.006748 240
volta_sgemm_32x32_sliced1x4_tn 9.055000e-06 0.000009 2.104289e-07 0.000011 0.001140 120
cudnn::gemm::computeWgradOffsetsKernel 1.696000e-06 0.000015 2.410338e-05 0.000072 0.000415 28
scal_kernel 1.504000e-06 0.000002 9.983884e-08 0.000002 0.000191 120
Showing that the operation that hogs most of the time is cudnn::detail::wgrad_alg1_nd_float_engine.
The problem is that I do not know what to do next since this is the first time I face this issue.
A co-worker of mine which ran it on a Pascal card said that it worked fine, but on my Volta card is having issues.
Please advise me on what to do. Thanks and stay safe!