I am focusing on an inference system with PyTorch. I find that the CUDA context initialization is always time-consuming, but I cannot find a tool to measure it. How can I solve the problem?
Also, I do not understand annotation here.
As an exception, several functions such as
copy_()admit an explicit
non_blockingargument, which lets the caller bypass synchronization when it is unnecessary.
Can I use the following code to measure the time of loading a trained model to GPU?
start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() # ==========Start of Code========== model = MyModel() model.load_state_dict(...) model.to(torch.device("cuda")) # ==========End of Code========== end_event.record() torch.cuda.synchronize() # Wait for the events to be recorded! elapsed_time_ms = start_event.elapsed_time(end_event)