About elementwise_kernel

In the output of the profiler, I see that unrolled_elementwise_kernel takes some GPU time (second kernel with highest GPU time). However, when I looked at the source code, this kernel simply calls another kernel

__global__ void unrolled_elementwise_kernel(int N, func_t f, array_t data,
                                            inp_calc_t ic, out_calc_t oc, loader_t l, storer_t s)
  int remaining = N - block_work_size * blockIdx.x;
  auto policy = memory::policies::unroll<array_t, inp_calc_t, out_calc_t, loader_t, storer_t>(data, remaining, ic, oc, l, s);
  elementwise_kernel_helper(f, policy);

So, I wonder why that is shown in the output of profiler? I mean what can be understand from that?

Could you post the profiling output you were creating as well as the current setup (CUDA, PyTorch version, used GPU etc.)?

The profiler shows which kernel spends what amount of time and could thus point towards the most expensive calls and bottlenecks in your code.