I didn’t mean to claim synchronizations would speed up your code, but should be used if you want to profile the actual GPU execution.
I.e. you should add a synchronization via torch.cuda.synchronize()
before using host timers via e.g. time.perf_counter()
.
1 Like