How to measure time of initializing CUDA context and loading model from CPU to GPU?

I am focusing on an inference system with PyTorch. I find that the CUDA context initialization is always time-consuming, but I cannot find a tool to measure it. How can I solve the problem?

Also, I do not understand annotation here.

As an exception, several functions such as to() and copy_() admit an explicit non_blocking argument, which lets the caller bypass synchronization when it is unnecessary.

Can I use the following code to measure the time of loading a trained model to GPU?

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# ==========Start of Code==========
model = MyModel()
model.load_state_dict(...)
model.to(torch.device("cuda"))
# ==========End of Code==========

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)