Confused about pytorch.profiler's output

I use nvidia nemo trained a bert model with NER task, then I want to mesure the forward() inference time on GPU.
First I use the below code, and I get the results, the forward() time almost range from 10ms to 300ms.

torch.cuda.synchronize()
start = time.perf_counter()
logits = self.forward(
    input_ids=input_ids.to(device),
    token_type_ids=input_type_ids.to(device),
    attention_mask=input_mask.to(device),
)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f'forward elapsed:{elapsed:.6f}', flush=True)

# the output
forward elapsed:0.015405
forward elapsed:0.114259
forward elapsed:0.279090
forward elapsed:0.067512
forward elapsed:0.051593
forward elapsed:0.259060
forward elapsed:0.059500
forward elapsed:0.295235

then I want to dig out why the results so inconsistent, I use the torch.profiler to get the detailed time consume, but the CPU time reached 2.x second, huge different from former results.

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_inference"):
        logits = self.forward(
            input_ids=input_ids.to(device),
            token_type_ids=input_type_ids.to(device),
            attention_mask=input_mask.to(device),
        )
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20), flush=True)

# output
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                        model_inference         1.24%      35.972ms        99.81%        2.895s        2.895s       0.000us         0.00%       1.846ms       1.846ms             1
                                               aten::to         0.03%     959.000us        92.87%        2.693s     673.317ms       0.000us         0.00%      11.000us       2.750us             4
                                            aten::copy_         0.09%       2.735ms        92.83%        2.692s     168.263ms      63.000us         3.41%      63.000us       3.938us            16
                                       cudaLaunchKernel        92.82%        2.692s        92.82%        2.692s       7.084ms       0.000us         0.00%       0.000us       0.000us           380
                                           aten::matmul         0.15%       4.414ms         3.25%      94.297ms     962.214us       0.000us         0.00%       1.034ms      10.551us            98
                                           aten::linear         0.06%       1.840ms         3.08%      89.196ms       1.189ms       0.000us         0.00%       1.264ms      16.853us            75
                                               aten::mm         2.41%      69.938ms         2.56%      74.208ms       1.003ms     928.000us        50.27%     928.000us      12.541us            74
                                              aten::bmm         0.43%      12.488ms         0.44%      12.860ms     535.833us     106.000us         5.74%     106.000us       4.417us            24
                                        aten::embedding         0.15%       4.314ms         0.37%      10.781ms       3.594ms       0.000us         0.00%      39.000us      13.000us             3
                                       aten::layer_norm         0.03%       1.009ms         0.37%      10.671ms     426.840us       0.000us         0.00%     170.000us       6.800us            25
                                aten::native_layer_norm         0.29%       8.490ms         0.33%       9.662ms     386.480us     170.000us         9.21%     170.000us       6.800us            25
                                             aten::rsub         0.08%       2.330ms         0.28%       8.233ms       8.233ms       0.000us         0.00%       2.000us       2.000us             1
                                        aten::clamp_min         0.14%       4.108ms         0.24%       7.000ms       3.500ms       2.000us         0.11%       4.000us       2.000us             2
                                              aten::sub         0.20%       5.881ms         0.20%       5.903ms       5.903ms       2.000us         0.11%       2.000us       2.000us             1
                                     aten::index_select         0.18%       5.212ms         0.20%       5.843ms       1.948ms      39.000us         2.11%      39.000us      13.000us             3
                                             aten::relu         0.05%       1.367ms         0.19%       5.505ms       5.505ms       0.000us         0.00%       2.000us       2.000us             1
                                            aten::zeros         0.18%       5.134ms         0.19%       5.470ms       5.470ms       0.000us         0.00%       0.000us       0.000us             1
                                          aten::softmax         0.02%     680.000us         0.12%       3.605ms     300.417us       0.000us         0.00%      33.000us       2.750us            12
                                              aten::mul         0.12%       3.470ms         0.12%       3.485ms       3.485ms       2.000us         0.11%       2.000us       2.000us             1
                                            aten::addmm         0.12%       3.403ms         0.12%       3.465ms       3.465ms      13.000us         0.70%      13.000us      13.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.900s
Self CUDA time total: 1.846ms

My question:

  1. Why torch.profiler echo a very big CPU time ?
  2. What is cudaLaunchKernel, Why it occupied 92.82% CPU time?