For allow_tf32
to true
:
NVTX Range Statistics:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- ------------ ------------ ---------- ---------- ------------ ------- ----------------------
42.7 60,504,152 10 6,050,415.2 1,258,376.0 996,015 49,626,209 15,311,893.5 PushPop backward
36.3 51,315,196 1 51,315,196.0 51,315,196.0 51,315,196 51,315,196 0.0 PushPop iteration0
6.7 9,430,330 10 943,033.0 889,925.0 839,709 1,419,396 170,547.6 PushPop forward
1.7 2,414,154 1 2,414,154.0 2,414,154.0 2,414,154 2,414,154 0.0 PushPop iteration8
1.7 2,390,547 1 2,390,547.0 2,390,547.0 2,390,547 2,390,547 0.0 PushPop iteration7
1.7 2,372,878 1 2,372,878.0 2,372,878.0 2,372,878 2,372,878 0.0 PushPop iteration9
1.7 2,364,427 1 2,364,427.0 2,364,427.0 2,364,427 2,364,427 0.0 PushPop iteration1
1.6 2,276,705 1 2,276,705.0 2,276,705.0 2,276,705 2,276,705 0.0 PushPop iteration2
1.5 2,136,254 1 2,136,254.0 2,136,254.0 2,136,254 2,136,254 0.0 PushPop iteration6
1.5 2,095,326 1 2,095,326.0 2,095,326.0 2,095,326 2,095,326 0.0 PushPop iteration4
1.4 2,025,274 1 2,025,274.0 2,025,274.0 2,025,274 2,025,274 0.0 PushPop iteration3
1.4 2,018,988 1 2,018,988.0 2,018,988.0 2,018,988 2,018,988 0.0 PushPop iteration5
0.1 209,595 1 209,595.0 209,595.0 209,595 209,595 0.0 PushPop cuBLAS:cublasCreate_v2
[4/8] Executing 'osrtsum' stats report
Operating System Runtime API Statistics:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------ ----------------------
58.8 100,167,418 1 100,167,418.0 100,167,418.0 100,167,418 100,167,418 0.0 poll
40.9 69,654,065 20 3,482,703.3 1,205,610.0 238,301 48,951,328 10,705,522.6 pthread_cond_wait
0.2 336,779 45 7,484.0 6,146.0 1,396 45,188 8,683.1 ioctl
0.1 207,222 117 1,771.1 1,466.0 1,396 6,635 815.2 pthread_cond_signal
0.0 81,016 1 81,016.0 81,016.0 81,016 81,016 0.0 pthread_create
0.0 8,661 6 1,443.5 1,466.5 1,397 1,467 36.0 pthread_cond_broadcast
[5/8] Executing 'cudaapisum' stats report
CUDA API Statistics:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ --------- -------- ---------- ------------ ----------------------------
46.5 39,396,681 3 13,132,227.0 5,169.0 3,841 39,387,671 22,737,881.5 cudaFree
46.5 39,389,556 3 13,129,852.0 2,794.0 1,466 39,385,296 22,737,881.5 cudaFree
2.4 2,050,982 158 12,980.9 11,384.5 7,054 89,677 8,886.0 cudaLaunchKernel
2.0 1,687,170 158 10,678.3 9,149.0 5,098 87,232 8,881.6 cudaLaunchKernel
0.5 459,903 378 1,216.7 1,396.5 907 2,374 251.7 cuGetProcAddress
0.4 344,948 31 11,127.4 10,616.0 7,683 20,673 2,454.4 cudaMemsetAsync
0.3 293,475 1 293,475.0 293,475.0 293,475 293,475 0.0 cuCtxSynchronize
0.3 275,873 31 8,899.1 8,730.0 5,657 18,717 2,443.7 cudaMemsetAsync
0.3 215,670 16 13,479.4 4,574.5 4,191 130,534 31,442.4 cudaStreamCreateWithFlags
0.2 135,075 20 6,753.8 6,740.0 4,680 10,337 1,399.0 cudaEventRecord
0.1 93,587 4 23,396.8 11,244.0 5,727 65,372 28,221.0 cudaMalloc
0.1 88,631 20 4,431.6 4,470.0 2,374 8,451 1,392.4 cudaEventRecord
0.1 84,300 4 21,075.0 9,044.5 2,794 63,417 28,479.9 cudaMalloc
0.1 46,653 20 2,332.7 2,374.0 1,397 3,352 420.9 cudaStreamIsCapturing_v10000
0.0 30,732 19 1,617.5 1,467.0 908 2,794 445.0 cudaEventCreateWithFlags
0.0 20,464 1 20,464.0 20,464.0 20,464 20,464 0.0 cudaHostAlloc
0.0 18,089 1 18,089.0 18,089.0 18,089 18,089 0.0 cudaHostAlloc
0.0 1,885 1 1,885.0 1,885.0 1,885 1,885 0.0 cuInit
0.0 1,467 1 1,467.0 1,467.0 1,467 1,467 0.0 cuMemHostGetDevicePointer_v2
0.0 1,397 1 1,397.0 1,397.0 1,397 1,397 0.0 cuModuleGetLoadingMode
[6/8] Executing 'gpukernsum' stats report
CUDA Kernel Statistics:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
57.8 7,203,562 10 720,356.2 719,492.5 718,356 725,172 2,262.6 sm80_xmma_wgrad_implicit_gemm_indexed_wo_smem_tf32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize32x16x64_st…
11.2 1,400,777 40 35,019.4 26,239.0 2,783 85,503 30,729.7 void cudnn::ops::nchwToNhwcKernel<float, float, float, (bool)0, (bool)1, (cudnnKernelDataType_t)2>(…
7.1 883,827 20 44,191.4 44,176.0 3,168 85,534 41,990.8 void cudnn::ops::nhwcToNchwKernel<float, float, float, (bool)1, (bool)0, (cudnnKernelDataType_t)0>(…
6.8 840,850 10 84,085.0 84,190.0 83,519 84,511 375.7 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::…
5.1 635,734 10 63,573.4 63,439.0 63,295 63,935 265.6 sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_tf32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize128x16x32_s…
4.2 528,695 20 26,434.8 26,352.0 3,680 49,504 23,225.9 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
4.0 493,495 10 49,349.5 49,359.0 49,215 49,439 70.4 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
3.2 395,993 10 39,599.3 39,583.5 39,423 39,807 113.1 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::…
0.4 47,294 18 2,627.4 2,624.0 2,559 2,784 70.2 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, at::deta…
0.2 22,303 10 2,230.3 2,240.0 2,207 2,240 15.6 void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::…
[7/8] Executing 'gpumemtimesum' stats report
CUDA Memory Operation Statistics (by time):
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- -------- -------- -------- -------- ----------- -------------
100.0 30,304 31 977.5 928.0 896 2,592 306.1 [CUDA memset]
[8/8] Executing 'gpumemsizesum' stats report
CUDA Memory Operation Statistics (by size):
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- -------------
0.015 31 0.000 0.000 0.000 0.013 0.002 [CUDA memset]