I have multiple training runs, all with DataLoaders accessing the same underlying Dataset on disk, and every training run is moving excruciatingly slowly (one CIFAR10 validation epoch takes ~10 minutes). Each DataLoader has 48 workers, in case that matters. I’m wondering whether the training runs are potentially blocking one another, slowing down training. Is this possibly the cause of the slow progress? If so, what should I do about it?
@ptrblck would you happen to know the answer?
torchvision CIFAR10 dataset is loading the data into the memory (as it’s quite small), so I wouldn’t expect to see a huge speedup using this large number of workers. Generally, 48 workers for each
DataLoader sounds quite excessive, so I would recommend to play around with this value and check, if you are creating the slowdown. This post is also a very good reference when it comes to data loading bottlenecks.
How could multiple workers create a slowdown?
From the linked post:
Beyond an optimal number (experiment!), throwing more worker processes at the IOPS barrier WILL NOT HELP, it’ll make it worse. You’ll have more processes trying to read files at the same time, and you’ll be increasing the shared memory consumption by significant amounts for additional queuing, thus increasing the paging load on the system and possibly taking you into thrashing territory that the system may never recover from