Does torch.cuda.Event influence performance?

I am currently using torch.cuda.Event(enable_timing=True) for measuring the forward, backward and total iteration time of DNN training. This data is used to predict the iteration time of a DNN training setup. I noticed that when I enable profiling, these timings are a little longer than expected, and consequently the prediction is higher than my baseline experiments. I derived the expected value from a tensorboard profiling report.

I am using the following context manager to measure the time of a code chunk:

@contextlib.contextmanager
def take_gpu_time() -> ContextManager[None]:
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    record = DurationRecord(start_event, end_event)
    start_event.record()
    yield record
    end_event.record()

The getter of DurationRecord.duration looks as follows:

@property
def duration(self):
    torch.cuda.synchronize()
    start_event.synchronize()
    end_event.synchronize()
    return start_event.elapsed_time(end_event)

With this asynchronous design, I hope to gain a lesser impact on performance by deferring the device synchronization to a point in time when I actually need the result (i.e., after the iteration completed).

Could this code have a notable impact on the performance of DNN training, especially when it’s called often (i.e., during each forward and backward call)?