Timing forward call in C++ frontend using libtorch

I am trying to measure the time spent in model inference (forward() call on torch::jit::script::Module).

torch::NoGradGuard no_grad;
model_outputs = torch_model.forward(input_tensors);

However, the timing is coming out to be really small. I found the reason is because of the default Asynchronous mode for GPU operations described here:
https://pytorch.org/docs/master/notes/cuda.html#asynchronous-execution

The description provides a solution for python using torch.cuda.Event abstraction, but I could not find analogous abstraction in C++.

Coming back to the example above, I believe the synchronization occurs when data is read out of the model_outputs. Is this understanding correct?

I tried the following solution to capture the correct timestamp on c10::cuda::getCurrentCUDAStream().stream()

cudaLaunchHostFunc(
        c10::cuda::getCurrentCUDAStream().stream(), TimestampCaptureCallback,
        reinterpret_cast<void*>(&compute_end_ns));

This gives me seemingly accurate results but is this solution generic and work for all kind of models? There is a risk in using internals which are not properly documented.

@ptrblck probably knows more about CUDA and might be able to help.

I’m not sure about the TimestampCaptureCallback approach, but in any case you could add manual synchronizations via:

auto stream = at::cuda::getCurrentCUDAStream(); // or at::cuda::getDefaultCUDAStream())
AT_CUDA_CHECK(cudaStreamSynchronize(stream));

to make sure your timings are correct. Alternatively you should also be able to use cudaDeviceSynchronize directly:

cudaError_t err = cudaDeviceSynchronize();
bool isEQ = err == cudaSuccess;
ASSERT_TRUE(isEQ);

The first approach does sound good but won’t it have an adverse affect on the performance? I thought cudaLaunchHostFunc method to be less intrusive. What are your reservations about this approach?

The second alternative will not work for us because we might have multiple models deployed and being served simultaneously on the GPU including models from TensorRT, TensorFlow, ONNX and pyTorch. Synchronizing the whole device will affect all these models negatively. See Triton Inference Server for more reference.

If there are no other ways then our best bet would be to go by the first method.

I’m not familiar with your use case and don’t know what exactly you would like to profile, but based on the docs:

Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized.

you would need to check, if it fits your use case.

If you are synchronizing on a custom or the default stream used in PyTorch it will impact the performance, but again I might not understand your use case and how you would like to time kernel executions without synchronizations/events.

I’m not familiar with your use case

I understand. I will try to simplify my use case.

I am trying to time the duration between from when the model execution started till when the outputs are first ready to be read with minimal impact to performance. I am not interested in profiling individual kernel executions. With our discussion so far this is a possible solution:

1. Record start timestamp
2. model_outputs = torch_model.forward(input_tensors)
3. 
auto stream = at::cuda::getCurrentCUDAStream(); // or at::cuda::getDefaultCUDAStream())
AT_CUDA_CHECK(cudaStreamSynchronize(stream));
4. Record end timestamp
5. Read output tensor from `model_outputs`.

The alternative I propose is to attach a single host function to just capture the end of the execution on the default stream. The only assumption being the execution on all the streams being used by pytorch was over when default stream ends execution. The description of default stream being the one doing most of the work adds some credibility to above assumption. Hence, the warning should not apply here.
Something like below:

1. Record start timestamp
2. model_outputs = torch_model.forward(input_tensors)
3. Record end timestamp in a callback on default stream.
4. Read output tensor from `model_outputs`.

Description of default stream:

The default stream is where most computation occurs when you aren’t explicitly using streams.
Link