I use nvidia nemo trained a bert model with NER task, then I want to mesure the forward()
inference time on GPU.
First I use the below code, and I get the results, the forward()
time almost range from 10ms to 300ms.
torch.cuda.synchronize()
start = time.perf_counter()
logits = self.forward(
input_ids=input_ids.to(device),
token_type_ids=input_type_ids.to(device),
attention_mask=input_mask.to(device),
)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f'forward elapsed:{elapsed:.6f}', flush=True)
# the output
forward elapsed:0.015405
forward elapsed:0.114259
forward elapsed:0.279090
forward elapsed:0.067512
forward elapsed:0.051593
forward elapsed:0.259060
forward elapsed:0.059500
forward elapsed:0.295235
then I want to dig out why the results so inconsistent, I use the torch.profiler
to get the detailed time consume, but the CPU time reached 2.x second, huge different from former results.
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("model_inference"):
logits = self.forward(
input_ids=input_ids.to(device),
token_type_ids=input_type_ids.to(device),
attention_mask=input_mask.to(device),
)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20), flush=True)
# output
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
model_inference 1.24% 35.972ms 99.81% 2.895s 2.895s 0.000us 0.00% 1.846ms 1.846ms 1
aten::to 0.03% 959.000us 92.87% 2.693s 673.317ms 0.000us 0.00% 11.000us 2.750us 4
aten::copy_ 0.09% 2.735ms 92.83% 2.692s 168.263ms 63.000us 3.41% 63.000us 3.938us 16
cudaLaunchKernel 92.82% 2.692s 92.82% 2.692s 7.084ms 0.000us 0.00% 0.000us 0.000us 380
aten::matmul 0.15% 4.414ms 3.25% 94.297ms 962.214us 0.000us 0.00% 1.034ms 10.551us 98
aten::linear 0.06% 1.840ms 3.08% 89.196ms 1.189ms 0.000us 0.00% 1.264ms 16.853us 75
aten::mm 2.41% 69.938ms 2.56% 74.208ms 1.003ms 928.000us 50.27% 928.000us 12.541us 74
aten::bmm 0.43% 12.488ms 0.44% 12.860ms 535.833us 106.000us 5.74% 106.000us 4.417us 24
aten::embedding 0.15% 4.314ms 0.37% 10.781ms 3.594ms 0.000us 0.00% 39.000us 13.000us 3
aten::layer_norm 0.03% 1.009ms 0.37% 10.671ms 426.840us 0.000us 0.00% 170.000us 6.800us 25
aten::native_layer_norm 0.29% 8.490ms 0.33% 9.662ms 386.480us 170.000us 9.21% 170.000us 6.800us 25
aten::rsub 0.08% 2.330ms 0.28% 8.233ms 8.233ms 0.000us 0.00% 2.000us 2.000us 1
aten::clamp_min 0.14% 4.108ms 0.24% 7.000ms 3.500ms 2.000us 0.11% 4.000us 2.000us 2
aten::sub 0.20% 5.881ms 0.20% 5.903ms 5.903ms 2.000us 0.11% 2.000us 2.000us 1
aten::index_select 0.18% 5.212ms 0.20% 5.843ms 1.948ms 39.000us 2.11% 39.000us 13.000us 3
aten::relu 0.05% 1.367ms 0.19% 5.505ms 5.505ms 0.000us 0.00% 2.000us 2.000us 1
aten::zeros 0.18% 5.134ms 0.19% 5.470ms 5.470ms 0.000us 0.00% 0.000us 0.000us 1
aten::softmax 0.02% 680.000us 0.12% 3.605ms 300.417us 0.000us 0.00% 33.000us 2.750us 12
aten::mul 0.12% 3.470ms 0.12% 3.485ms 3.485ms 2.000us 0.11% 2.000us 2.000us 1
aten::addmm 0.12% 3.403ms 0.12% 3.465ms 3.465ms 13.000us 0.70% 13.000us 13.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.900s
Self CUDA time total: 1.846ms
My question:
- Why
torch.profiler
echo a very big CPU time ? - What is cudaLaunchKernel, Why it occupied 92.82% CPU time?