I am currently using torch.cuda.Event(enable_timing=True)
for measuring the forward, backward and total iteration time of DNN training. This data is used to predict the iteration time of a DNN training setup. I noticed that when I enable profiling, these timings are a little longer than expected, and consequently the prediction is higher than my baseline experiments. I derived the expected value from a tensorboard profiling report.
I am using the following context manager to measure the time of a code chunk:
@contextlib.contextmanager
def take_gpu_time() -> ContextManager[None]:
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
record = DurationRecord(start_event, end_event)
start_event.record()
yield record
end_event.record()
The getter of DurationRecord.duration
looks as follows:
@property
def duration(self):
torch.cuda.synchronize()
start_event.synchronize()
end_event.synchronize()
return start_event.elapsed_time(end_event)
With this asynchronous design, I hope to gain a lesser impact on performance by deferring the device synchronization to a point in time when I actually need the result (i.e., after the iteration completed).
Could this code have a notable impact on the performance of DNN training, especially when it’s called often (i.e., during each forward and backward call)?