Why doesn't the instrumented execution time match the time captured by nsys?

lilill · March 19, 2025, 1:41pm

Forgive me for asking what may seem like a silly question.

When I measure time using instrumentation, I only get a few milliseconds. Here’s my measurement method:

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
// module
end_event.record()

torch.cuda.synchronize()

elapsed_time = start_event.elapsed_time(end_event)
print(f"total time: {elapsed_time} ms")

However, when I look at the timeline captured by nsys, I can see that there’s a 1-second gap just between two batch norm kernel calls.

Since there’s a 1-second gap even between two batch norm kernels, the total execution time certainly shouldn’t be just a few milliseconds. Why is this happening? Am I misunderstanding the data captured by nsys? Or is there an issue with my measurement method? Or does this measurement method only count the sum of kernel execution times without including idle periods?

Additionally, could I use the time elapsed between CUDA profiling initialization and CUDA profiling data flush as my total time (including time for H to D or D to H data transfers)?

ptrblck · March 19, 2025, 1:46pm

Your events measure the kernel execution time on the GPU, not host overhead, IDLE times, CPU bottlenecks etc. The Nsight System view shows you the actual timeline with kernel lauches, kernel execution time, etc.

lilill · March 19, 2025, 2:00pm

Thank you very much for your response!! I think I understand what you mean.

I’d like to ask some more detailed questions:

If I want to get the actual execution time (including kernel launch and idle waiting times in between), can I get that directly through instrumentation in PyTorch?

Or can I get it directly from nsys, but I’m not sure which point in the nsys timeline I should use as the start of timing and which point as the end?

ptrblck · March 19, 2025, 3:29pm

I would recommend sticking to a visual profiler to get a full overview of the execution timeline. Nsight systems or the native PyTorch profiler would be some options.
You can use nvtx markers inside your code to mark regions which would show up in nsys.

lilill · March 20, 2025, 9:01am

I’d like to ask one more question:

Why is the idle time between kernels so long? The kernel execution time is only a few milliseconds, but the waiting time is over a second, which is a completely different order of magnitude.

Is this normal behavior without applying other optimizations? What causes this situation?
Or is it because the model I’m executing or the input data is relatively small, resulting in this phenomenon?

ptrblck · March 20, 2025, 1:18pm

Your workload might be CPU-limited meaning your CPU is not fast enough in scheduling the kernels. You could try to apply CUDA Graphs to reduce the CPU overhead and profile it again.