See messages about needing to use torch.gpu.synchronize() at the end of an iteration for “correct” time . Does not adding this statement destroy data transfer and compute overlapping ? Hence, increases duration or reduces throughput ? That certainly would not produce the correct time , would it ?
A synchronization is needed if you want to profile the workload since CUDA operations are executed asynchronously.
If you don’t synchronize you would profile the dispatching and kernel launch instead of the actual kernel execution time.
You are right that synchronizations slow down your code and should be generally avoided unless you want to profile the GPU workload.
@ptrblck , thanks for confirming my suspicion. In that case, what is the most accurate way to measure throughput performance ( images/sec, samples/sec , etc.) with time.time() - per epoch , per 100 mini-batches while not handicapping measurements with synchronize() ?
Synchronize the code before starting and stopping the timers and execute the code for
N iterations. Once done, calculate the average.
So something like the following:
start_timer torch.gpu.synchronize() # run N iters torch.gpu.synchronize() stop_timer
Also, I think when N is sufficiently large, synchronization is not really needed . That number likely is closer to the maximum attainable rate.
The start timer should be started after the first synchronization.
Makes sense to not include synchronize() delay of previous ops. if N is 25+ , I think we can do away with the synchronize altogether.