I am trying to measure the time spent in model inference (forward() call on torch::jit::script::Module).
torch::NoGradGuard no_grad; model_outputs = torch_model.forward(input_tensors);
However, the timing is coming out to be really small. I found the reason is because of the default Asynchronous mode for GPU operations described here:
The description provides a solution for python using
torch.cuda.Event abstraction, but I could not find analogous abstraction in C++.
Coming back to the example above, I believe the synchronization occurs when data is read out of the
model_outputs. Is this understanding correct?
I tried the following solution to capture the correct timestamp on
cudaLaunchHostFunc( c10::cuda::getCurrentCUDAStream().stream(), TimestampCaptureCallback, reinterpret_cast<void*>(&compute_end_ns));
This gives me seemingly accurate results but is this solution generic and work for all kind of models? There is a risk in using internals which are not properly documented.