Loss computation time constantly increases

My current implementation is using pytorch 0.4 with python 3.6 and I am facing the issue that the computation time of the loss (nn.MSE) is constantly increasing if I enable pinned_memory in the data loader.

I.e the first batch requires 120 ms of computational time, the second requires 270 ms and all remaining batches require 630 ms.
After num_workers batches have been processed, the computational time drops down to 120 ms and increases in the same manner.

Beside enabling pinned memory in the data loader, I manually pin the memory of all tensors contained in the batch and move them to GPU as follows

gt = sampled_batch['gt'].pin_memory().cuda(device=self.cuda_device, non_blocking=True)
mask = sampled_batch['mask'].pin_memory().cuda(device=self.cuda_device, non_blocking=True)
input = sampled_batch['image'].pin_memory().cuda(device=self.cuda_device, non_blocking=True)
input_var.requires_grad_()

For testing purpose, I simply calculate the loss as follows and measure the time

start = time.time()
loss = self.criterion(estimate, gt) 
print("Calculation time %f" % (time.time() - start))

Disabling pinned memory would solve the issue for the loss calculation with the downside of increasing the required time to copy the batch to GPU within each iteration.

What am I missing here?

Thanks in advance

If your code runs on GPUs, you should call torch.cuda.synchronize() before stopping the time.
Could you add this line of code and test it again?

Thanks for your reply. Now the time required for the calculation of the loss is constant.
But of course, synchronization is now consuming a lot of time.

Why is it the case, that the time required for synchronization is increasing until it ‘saturates’ and is dropping again after num_worker batches have been processed?

If I set non_blocking=False during cuda calls, the required time for synchronization is constant, but then it takes more time to move each batch to GPU. Is there any way to speed up the overall processing time or is synchronization a bottleneck I can not avoid?

You don’t need to synchronize your code for training.
It’s just necessary, if you would like to time your code.
CUDA calls are asynchronous, i.e. they work in the background while your python code calls the next line, stopping the time.
That is why you need to wait for the CUDA operation to finish, before you take the time.

In your live training code, you should remove the synch.

Yes, I am aware of this. Nevertheless the effective training time is not changing and there does not seem to be a way of improving the speed of training.
Anyway, now I understand what is causing the bottleneck.
Thanks for you help