Help me understand my profiled function

I’ve designed a filter and ran it through the pytorch profiler to better understand if I can make the filter any faster.

I call the filter in this fashion:

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    wiener_filtered = wiener_denoiser(curr_img.to(device), std_curr.to(device) * k)
print(prof.key_averages().table(sort_by="self_cuda_time_total"))
prof.export_chrome_trace("trace.json")

Below is the json view of the profile:

One question I have is that the description of some operations, at the bottom of the ‘tree’ are still called ‘cpu_op’. For example, the ifftn function on the far right has the category ‘cpu_op’.

Am I misunderstanding something? I would have assumed these operations were carried out with CUDA.

The second question I have is what the cudaFuncGetAttributes function does and why does it take so long?

Thanks.

Below is the printed profile generated:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                aten::cudnn_convolution         3.05%      75.595ms        29.34%     726.126ms      66.011ms     912.269ms        36.87%     912.269ms      82.934ms            11  
                                   DataParallel.forward         0.99%      24.441ms        99.65%        2.466s        2.466s     839.038ms        33.91%        2.468s        2.468s             1  
                                     aten::masked_fill_         0.00%      35.000us        12.11%     299.720ms     299.720ms     299.730ms        12.11%     299.730ms     299.730ms             1  
                                            aten::copy_         0.06%       1.562ms         0.23%       5.775ms      54.481us     192.428ms         7.78%     192.428ms       1.815ms           106  
                                       aten::leaky_relu         0.01%     298.000us         0.24%       5.838ms     648.667us      49.063ms         1.98%      49.063ms       5.451ms             9  
                                              aten::pow         0.01%     243.000us         1.55%      38.437ms       7.687ms      37.515ms         1.52%      37.535ms       7.507ms             5  
                                         aten::_fft_c2c        17.66%     436.915ms        17.72%     438.542ms     219.271ms      30.159ms         1.22%      35.002ms      17.501ms             2  
                                             aten::sort         0.01%     141.000us         0.17%       4.260ms       4.260ms      20.282ms         0.82%      21.060ms      21.060ms             1  
                                              aten::mul         0.01%     259.000us         0.21%       5.300ms     481.818us      16.536ms         0.67%      16.536ms       1.503ms            11  
                                              aten::sub         0.00%      55.000us         0.39%       9.659ms       4.830ms      11.207ms         0.45%      11.207ms       5.604ms             2  
                                        aten::clamp_min         0.00%      30.000us         0.46%      11.383ms      11.383ms       7.910ms         0.32%       7.910ms       7.910ms             1  
                                            aten::ceil_         0.00%      49.000us         0.28%       6.889ms       6.889ms       6.898ms         0.28%       6.898ms       6.898ms             1  
                                              aten::any         0.00%      42.000us         0.27%       6.731ms       6.731ms       6.415ms         0.26%       6.415ms       6.415ms             1  
                                              aten::div         0.00%     101.000us         0.24%       5.932ms       1.483ms       6.342ms         0.26%       6.437ms       1.609ms             4  
                                             aten::mean         0.00%      70.000us         0.23%       5.771ms       2.885ms       5.915ms         0.24%       5.915ms       2.958ms             2  
                                              aten::cat         0.00%      59.000us         0.23%       5.606ms       5.606ms       5.613ms         0.23%       5.613ms       5.613ms             1  
                                              aten::abs         0.21%       5.278ms         0.44%      10.957ms       5.479ms       5.421ms         0.22%      13.120ms       6.560ms             2  
                                              aten::add         0.00%     118.000us         0.01%     203.000us      33.833us       4.611ms         0.19%       4.611ms     768.500us             6  
                                           aten::gather         0.00%      74.000us         0.15%       3.609ms       3.609ms       3.608ms         0.15%       3.617ms       3.617ms             1  
                                           aten::col2im         0.00%      51.000us         0.44%      10.949ms       5.474ms       2.587ms         0.10%       2.600ms       1.300ms             2  
                                              aten::exp         0.08%       2.064ms         0.08%       2.064ms       1.032ms       2.083ms         0.08%       2.083ms       1.042ms             2  
                                            aten::fill_         0.00%      18.000us         0.10%       2.554ms       2.554ms       1.867ms         0.08%       1.867ms       1.867ms             1  
                                 aten::reflection_pad3d         0.00%      85.000us         0.06%       1.470ms     735.000us       1.520ms         0.06%       1.611ms     805.500us             2  
                                            aten::slice         0.09%       2.208ms         0.09%       2.308ms      11.259us     806.000us         0.03%       1.065ms       5.195us           205  
                                    aten::_pad_circular         0.12%       2.927ms         0.31%       7.764ms     705.818us     768.000us         0.03%     175.574ms      15.961ms            11  
                                               aten::ne         0.00%      41.000us         0.68%      16.802ms      16.802ms     517.000us         0.02%     517.000us     517.000us             1  
                                       aten::as_strided         0.01%     167.000us         0.01%     167.000us       0.542us     494.000us         0.02%     494.000us       1.604us           308  
                                           aten::repeat         0.03%     812.000us         0.09%       2.211ms     245.667us     410.000us         0.02%       3.301ms     366.778us             9  
                                            aten::empty         0.02%     410.000us         0.07%       1.660ms      36.889us     387.000us         0.02%     387.000us       8.600us            45  
                                           aten::unfold         0.02%     435.000us         0.02%     456.000us       8.604us     313.000us         0.01%     437.000us       8.245us            53  
                                         aten::quantile         0.01%     248.000us        13.85%     342.607ms     342.607ms     157.000us         0.01%     341.885ms     341.885ms             1  
                                                Scatter         0.01%     140.000us         0.01%     240.000us     120.000us     139.000us         0.01%     246.000us     123.000us             2  
                                         aten::_to_copy         0.01%     179.000us         0.15%       3.590ms     512.857us     135.000us         0.01%       4.656ms     665.143us             7  
                                        aten::unsqueeze         0.01%     170.000us         0.01%     180.000us      11.250us     119.000us         0.00%     161.000us      10.062us            16  
                                           aten::expand         0.01%     168.000us         0.01%     177.000us       9.316us     113.000us         0.00%     159.000us       8.368us            19  
                                               aten::to         0.00%      71.000us         0.15%       3.661ms     281.615us      96.000us         0.00%       4.752ms     365.538us            13  
                                           aten::arange         0.00%      92.000us         0.01%     158.000us      39.500us      86.000us         0.00%     173.000us      43.250us             4  
                                    aten::empty_strided         0.00%      88.000us         0.00%      88.000us      11.000us      84.000us         0.00%      84.000us      10.500us             8  
                                            aten::clone         0.01%     215.000us         0.06%       1.481ms     148.100us      71.000us         0.00%      10.436ms       1.044ms            10  
                            aten::flatten_dense_tensors         0.00%     103.000us         0.23%       5.724ms       5.724ms      70.000us         0.00%       5.728ms       5.728ms             1  
                                             aten::view         0.00%      64.000us         0.00%      64.000us       2.783us      58.000us         0.00%      58.000us       2.522us            23  
                                       aten::empty_like         0.03%     668.000us         0.04%       1.113ms      85.615us      58.000us         0.00%      81.000us       6.231us            13  
                                              aten::pad         0.01%     145.000us         0.38%       9.379ms     721.462us      57.000us         0.00%     177.242ms      13.634ms            13  
                                        aten::expand_as         0.00%      66.000us         0.01%     141.000us      15.667us      55.000us         0.00%     132.000us      14.667us             9  
                                     aten::_convolution         0.01%     283.000us        29.36%     726.409ms      66.037ms      46.000us         0.00%     912.315ms      82.938ms            11  
                                           aten::conv3d         0.01%     134.000us        29.37%     726.714ms      66.065ms      43.000us         0.00%     912.401ms      82.946ms            11  
                                      aten::convolution         0.01%     171.000us        29.36%     726.580ms      66.053ms      43.000us         0.00%     912.358ms      82.942ms            11  
                                          aten::permute         0.01%     139.000us         0.01%     158.000us      15.800us      42.000us         0.00%      52.000us       5.200us            10  
                                        aten::new_empty         0.01%     126.000us         0.03%     630.000us      57.273us      42.000us         0.00%      55.000us       5.000us            11  
                                       aten::contiguous         0.00%      57.000us         0.04%       1.049ms     149.857us      31.000us         0.00%       7.287ms       1.041ms             7  
                                            aten::split         0.00%      29.000us         0.00%      81.000us      40.500us      28.000us         0.00%      86.000us      43.000us             2  
                                           aten::narrow         0.00%      27.000us         0.00%      52.000us      26.000us      27.000us         0.00%      58.000us      29.000us             2  
                                         aten::squeeze_         0.00%      24.000us         0.00%      33.000us      33.000us      24.000us         0.00%      36.000us      36.000us             1  
                                              aten::neg         0.00%      17.000us         0.00%      17.000us       8.500us      23.000us         0.00%      23.000us      11.500us             2  
                                            aten::alias         0.00%       7.000us         0.00%       7.000us       0.778us      23.000us         0.00%      23.000us       2.556us             9  
                                          aten::reshape         0.00%     107.000us         0.02%     550.000us     137.500us      22.000us         0.00%       3.190ms     797.500us             4  
                                            aten::chunk         0.00%      19.000us         0.00%     100.000us      50.000us      21.000us         0.00%     107.000us      53.500us             2  
                                          aten::resize_         0.00%      36.000us         0.00%      36.000us       4.000us      20.000us         0.00%      20.000us       2.222us             9  
                                      aten::as_strided_         0.00%      41.000us         0.00%      41.000us       6.833us      18.000us         0.00%      18.000us       3.000us             6  
                                      aten::masked_fill         0.00%      49.000us        12.12%     299.824ms     299.824ms      17.000us         0.00%     299.763ms     299.763ms             1  
                                             aten::real         0.00%      72.000us         0.01%     158.000us      79.000us      14.000us         0.00%      27.000us      13.500us             2  
                                      aten::result_type         0.00%       3.000us         0.00%       3.000us       0.600us      11.000us         0.00%      11.000us       2.200us             5  
                                    aten::scalar_tensor         0.00%      24.000us         0.10%       2.585ms       2.585ms      10.000us         0.00%       1.878ms       1.878ms             1  
                                                 Gather         0.00%      79.000us         0.00%     113.000us     113.000us       9.000us         0.00%      56.000us      56.000us             1  
                                           aten::select         0.00%      42.000us         0.00%      47.000us      23.500us       8.000us         0.00%      11.000us       5.500us             2  
                                         aten::fft_fftn         0.00%      47.000us         1.05%      25.958ms      25.958ms       6.000us         0.00%      25.671ms      25.671ms             1  
                                            aten::isnan         0.00%      29.000us         0.68%      16.831ms      16.831ms       4.000us         0.00%     521.000us     521.000us             1  
                                aten::broadcast_tensors         0.00%       9.000us         0.00%      29.000us      29.000us       4.000us         0.00%       9.000us       9.000us             1  
                                 aten::split_with_sizes         0.00%      12.000us         0.00%      12.000us      12.000us       4.000us         0.00%       5.000us       5.000us             1  
                                             aten::relu         0.00%      26.000us         0.46%      11.409ms      11.409ms       3.000us         0.00%       7.913ms       7.913ms             1  
                                        aten::fft_ifftn         0.00%      17.000us        16.68%     412.760ms     412.760ms       3.000us         0.00%      10.495ms      10.495ms             1  
                                             aten::mul_         0.00%      20.000us         0.00%      78.000us      78.000us       3.000us         0.00%       1.525ms       1.525ms             1  
                                     aten::_unsafe_view         0.00%       5.000us         0.00%       5.000us       2.500us       2.000us         0.00%       2.000us       1.000us             2  
                                   aten::_reshape_alias         0.00%       4.000us         0.00%       4.000us       2.000us       2.000us         0.00%       2.000us       1.000us             2  
                                     aten::view_as_real         0.00%      39.000us         0.00%      39.000us      19.500us       2.000us         0.00%       2.000us       1.000us             2  
                                        cudaMemcpyAsync         0.08%       1.879ms         0.08%       1.879ms      27.232us       0.000us         0.00%       0.000us       0.000us            69  
                                  cudaStreamSynchronize         0.01%     241.000us         0.01%     241.000us       6.514us       0.000us         0.00%       0.000us       0.000us            37  
                                  cudaStreamIsCapturing         0.00%      19.000us         0.00%      19.000us       0.633us       0.000us         0.00%       0.000us       0.000us            30  
                                             cudaMalloc         6.08%     150.534ms         6.08%     150.534ms       1.244ms       0.000us         0.00%       0.000us       0.000us           121  
                                       cudaLaunchKernel        18.64%     461.179ms        18.64%     461.179ms       1.270ms       0.000us         0.00%       0.000us       0.000us           363  
                                   cudaDriverGetVersion         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             3  
                                     cudaGetDeviceCount         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             3  
                                               cudaFree        17.91%     443.161ms        17.91%     443.161ms      34.089ms       0.000us         0.00%       0.000us       0.000us            13  
                    cudaThreadExchangeStreamCaptureMode         0.00%      52.000us         0.00%      52.000us       0.157us       0.000us         0.00%       0.000us       0.000us           332  
                                          cudaHostAlloc         0.45%      11.012ms         0.45%      11.012ms       2.753ms       0.000us         0.00%       0.000us       0.000us             4  
                                 cudaDeviceGetAttribute         0.01%     134.000us         0.01%     134.000us       3.190us       0.000us         0.00%       0.000us       0.000us            42  
                                  cudaFuncGetAttributes        32.55%     805.515ms        32.55%     805.515ms     447.012us       0.000us         0.00%       0.000us       0.000us          1802  
                                   cudaFuncSetAttribute         0.00%      16.000us         0.00%      16.000us       0.009us       0.000us         0.00%       0.000us       0.000us          1801  
                                  cudaDeviceGetPCIBusId         0.00%       2.000us         0.00%       2.000us       2.000us       0.000us         0.00%       0.000us       0.000us             1  
                              cudaStreamCreateWithFlags         0.08%       1.867ms         0.08%       1.867ms      98.263us       0.000us         0.00%       0.000us       0.000us            19  
                                        cudaMemsetAsync         0.01%     181.000us         0.01%     181.000us       1.828us       0.000us         0.00%       0.000us       0.000us            99  
                                    cudaStreamWaitEvent         0.00%       4.000us         0.00%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36  
                                      cudaStreamDestroy         0.00%      11.000us         0.00%      11.000us      11.000us       0.000us         0.00%       0.000us       0.000us             1  
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.00%      98.000us         0.00%      98.000us      24.500us       0.000us         0.00%       0.000us       0.000us             4  
                                    cudaPeekAtLastError         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us            12  
                                         cuLaunchKernel         0.00%      85.000us         0.00%      85.000us       9.444us       0.000us         0.00%       0.000us       0.000us             9  
                             cudaGetDeviceProperties_v2         0.01%     131.000us         0.01%     131.000us     131.000us       0.000us         0.00%       0.000us       0.000us             1  
                               cudaHostGetDevicePointer         0.00%       1.000us         0.00%       1.000us       1.000us       0.000us         0.00%       0.000us       0.000us             1  
                                   cudaGetSymbolAddress         0.00%      79.000us         0.00%      79.000us      79.000us       0.000us         0.00%       0.000us       0.000us             1  
                                  cudaStreamGetPriority         0.00%       5.000us         0.00%       5.000us       0.455us       0.000us         0.00%       0.000us       0.000us            11  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.474s
Self CUDA time total: 2.475s

OK, I regenerated the plot with the stack flag to true and have the updated figure:

This answers Q2 in part. I trained the model as DataParallel so I guess I should convert it to a conventional model before running inference on a single GPU to avoid this overhead.