Batch size of batches returned by data loader sometimes different when using IterableDataset and multiple workers

visionscaper · April 24, 2022, 7:32pm

Hi,

I have this issue with DataLoaders I have never seen before. It only occurs when using an IterableDataset and multiple DataLoader workers. What I have observed is that when the batch size should be (e.g.) 72, sometimes the batches produced by the DataLoaders have other smaller sizes, e.g. I get the batch sizes over time as follows (scenario is DistributedDataParallel training with multiple workers per DataLoader):

Device 0: 72 72 72 72 … 20 72 … 4 72 72 …
Device 1: 72 72 27 72 … 72 72 18 72 72 …

Etc.

So this is not about a trailing batch with a smaller size, but seemingly randomly, sometimes the batches produced by the DataLoaders are smaller than expected!

My question is, is this know behaviour of DataLoaders that use IterableDatasets and multiple workers?
How could something like this be caused by my IterableDataset implementation?
Could this be a bug in the DataLoader code?

I’m tearing my hear out over this one …

tom · April 24, 2022, 7:40pm

Did you try to use drop_last = True? Do you know if this happnes when the IterableDataset indicates that it does not have more data?
The second note on the DataLoader documentation on batching seems to imply that that might help.

Best regards

Thomas

visionscaper · April 24, 2022, 8:07pm

Thanks @tom, as I have stated, this does not happen at the last batch, so drop_last=True would probably not help. Further, I do log when the IterableDataset has no more samples, but this is not the case; if it were, the DataLoader would also not be able to create a next batch that has the full batch size.

Please see the batch size pattern over time that I shared; the batch size sometimes drops to a lower number … but continues with the normal batch size afterwards. It’s really odd, this should not happen with DataLoaders, right?

visionscaper · April 25, 2022, 12:43am

Hi @tom, I investigated this further and found that setting drop_last=True worked for me. Initially I didn’t expect this to work because I had a different mental model about how the Dataloader uses the workers to create batches. I was under the impression that the workers would provide individual samples, and that subsequently the dataloader would package these samples in to batches. In this model only the very last batch can be incomplete.

However, apparently, the workers provide complete batches, and thus when the IterableDataset of the worker reaches its end, the trailing batch can have less samples.

These incomplete batches can appear anywhere in the stream of batches because if the dataset for one worker is depleted the other workers simply continue to produce batches.

Is this is a correct description of what happens?

tom · April 25, 2022, 4:17am

This is approximately what I had in mind, yes. In particular that workers provide batches.
To my mind, this not aspect of IterableDataset's behaviour, which is more a result of the technical constraints than the ideal “intuitive” behaviour, is much of the attraction of classic map-style datasets which seem to be closer to what people expect.

Best regards

Thomas

P.S.: Glad it’s solved.