I am currently doing some experiments where I want to measure data loading times and compute times (i.e. forward/backward pass) for a couple of models (starting of with Resnet50 on ImageNet). However, I sometimes encounter negative time measurements in my experiments and would like to understand if this issue is related to PyTorch or rather core python components or something else in my code base.
My code is based on the deep learning examples from NVIDIA which in turn are based on an example from the PyTorch community. The core of my time measurements looks like this:
for i, (data, label) in data_iter:
batch_size = data.size(0)
lr_policy(optimizer, i, epoch)
>>> data_time = time.time() - end
optimizer_step = ((i + 1) % batch_size_multiplier) == 0
# Forward/backward happens in step()
loss = step(data, label, optimizer_step=optimizer_step)
>>> iteration_time = time.time() - end
if logger is not None:
<some logging stuff>
>>> end = time.time()
My logs then show measurements like this:
Train [1/1492] t.data_time : 0.00018 s t.compute_time : 0.26296 s
Train [1/1493] t.data_time : 0.00018 s t.compute_time : 0.26300 s
Train [1/1494] t.data_time : 0.00018 s t.compute_time : 0.26295 s
Train [1/1495] t.data_time : 0.00018 s t.compute_time : 0.26285 s
Train [1/1496] t.data_time : 0.00018 s t.compute_time : -2.66764 s
Train [1/1497] t.data_time : 0.00018 s t.compute_time : 0.26303 s
Train [1/1498] t.data_time : 0.00017 s t.compute_time : 0.26280 s
Train [1/1499] t.data_time : 0.00017 s t.compute_time : 0.26296 s
I would greatly appreciate some insight of where these negative measurements could come from. If it helps, I can also provide more information on my experiments and code base.