Cons of using torch.gpu.synchronize()

whatdhack · April 1, 2022, 6:49am

See messages about needing to use torch.gpu.synchronize() at the end of an iteration for “correct” time . Does not adding this statement destroy data transfer and compute overlapping ? Hence, increases duration or reduces throughput ? That certainly would not produce the correct time , would it ?

ptrblck · April 1, 2022, 8:58am

A synchronization is needed if you want to profile the workload since CUDA operations are executed asynchronously.
If you don’t synchronize you would profile the dispatching and kernel launch instead of the actual kernel execution time.

You are right that synchronizations slow down your code and should be generally avoided unless you want to profile the GPU workload.

whatdhack · April 1, 2022, 2:21pm

@ptrblck , thanks for confirming my suspicion. In that case, what is the most accurate way to measure throughput performance ( images/sec, samples/sec , etc.) with time.time() - per epoch , per 100 mini-batches while not handicapping measurements with synchronize() ?

ptrblck · April 1, 2022, 8:13pm

Synchronize the code before starting and stopping the timers and execute the code for N iterations. Once done, calculate the average.

whatdhack · April 1, 2022, 10:02pm

So something like the following:

start_timer
torch.gpu.synchronize()
# run N iters
torch.gpu.synchronize()
stop_timer

Also, I think when N is sufficiently large, synchronization is not really needed . That number likely is closer to the maximum attainable rate.

ptrblck · April 2, 2022, 1:47am

The start timer should be started after the first synchronization.

whatdhack · April 2, 2022, 6:41am

Makes sense to not include synchronize() delay of previous ops. if N is 25+ , I think we can do away with the synchronize altogether.