Here’s a graph:
- green = each time a batch is produced by (1 of 8) workers
- blue = each time the GPU starts processing a batch
- orange = each time the GPU finishes processing a batch
We can tell there’s no data loader starvation here because:
As soon as a orange dot appears (which indicates the GPU has finished processing), then a blue dot appears next (indicating the GPU is starting to process the next batch).
Now, here’s a graph with a slower dataloader:
We can see that there’s data starvation, namely, because sometimes there is a long gap between an orange dot and the next blue dot.
However, while it is clear the issue is data loader starvation (I know because the fast data loader is identical to the slow data loader, except it just generates random data instead of performing expensive augmentations), I’m not sure why.
For example, it is clear that there is a long gap between batch 45 and batch 46. But, we see batch 46 is generated long before it is actually consumed. So I would expect batch 46 to be consumed immediately after batch 45.
I’m not sure what could be causing such a long delay between batch 45 & 46. (And I’m using these 2 batches as an example).
Does anyone know what could be going on here?