Say I have some code that times some CUDA calls like this
import torch
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
#
# some GPU work
#
end.record()
torch.cuda.synchronize() # <--- what if i remove this?
print(start.elapsed_time(end))
What will happen if I remove the explicit synchronization? Will elapsed_time actually block until both event finishes and still report the correct time, or will it report something unexpected?
CUDA events are of type cudaEvent_t and are created and destroyed with cudaEventCreate() and cudaEventDestroy(). In the above code cudaEventRecord() places the start and stop events into the default stream, stream 0. The device will record a time stamp for the event when it reaches that event in the stream. The function cudaEventSynchronize() blocks CPU execution until the specified event is recorded. The cudaEventElapsedTime() function returns in the first argument the number of milliseconds time elapsed between the recording of start and stop. This value has a resolution of approximately one half microsecond.
You wouldn’t strictly need to synchronize the entire device and can also call end.synchronize() (or sync the stream). However, in minimal code snippets, you might not want to care if you sync the device.
If you don’t call any sync, the code will properly fail with: