I am focusing on an inference system with PyTorch. I find that the CUDA context initialization is always time-consuming, but I cannot find a tool to measure it. How can I solve the problem?
Also, I do not understand annotation here.
As an exception, several functions such as
to()
andcopy_()
admit an explicitnon_blocking
argument, which lets the caller bypass synchronization when it is unnecessary.
Can I use the following code to measure the time of loading a trained model to GPU?
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
# ==========Start of Code==========
model = MyModel()
model.load_state_dict(...)
model.to(torch.device("cuda"))
# ==========End of Code==========
end_event.record()
torch.cuda.synchronize() # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)