I notice a strange behavior when training.
If for instance, I have a batch size of N=(1, 2, 3, …) and 8 workers (or more),
I notice that during training, the loop pauses for a few seconds every 8th iterations. As if that pause is somehow used to load data or something of the like.
This behavior appears on a GTX 1080TI GPU.
The same code when executed on a TITAN RTX does not exhibit the above behavior.
Has someone else encountered this issue? could this be an issue with my computer?
for epoch in range(1 + epoch, NUM_EPOCHS + 1):
for tensors in train_loader:
input_image = tensors['input_raw']
gt_image = tensors['target_rgb']
iteration += 1
batch_size = gt_image.size(0)
input_image = input_image.to(device)
gt_image = gt_image.to(device)
prediction_tensor, features = network(input_image)
loss_target_second_tensor = l1_loss(prediction_tensor, gt_image)
loss = loss_target_second_tensor
The issue sounds indeed like a data loading bottleneck similar to this one so you could profile your code to isolate the bottleneck further.
As a quick test, you could replace the data loading with a single random tensor and let the model train on it. If the periodic slowdown is gone, it would point towards the data loading.