Can multiple training runs, all reading the same data on disk, slow each other down?

RylanSchaeffer · December 13, 2021, 5:36am

I have multiple training runs, all with DataLoaders accessing the same underlying Dataset on disk, and every training run is moving excruciatingly slowly (one CIFAR10 validation epoch takes ~10 minutes). Each DataLoader has 48 workers, in case that matters. I’m wondering whether the training runs are potentially blocking one another, slowing down training. Is this possibly the cause of the slow progress? If so, what should I do about it?

RylanSchaeffer · December 13, 2021, 5:37am

@ptrblck would you happen to know the answer?

ptrblck · December 13, 2021, 9:22pm

The torchvision CIFAR10 dataset is loading the data into the memory (as it’s quite small), so I wouldn’t expect to see a huge speedup using this large number of workers. Generally, 48 workers for each DataLoader sounds quite excessive, so I would recommend to play around with this value and check, if you are creating the slowdown. This post is also a very good reference when it comes to data loading bottlenecks.

RylanSchaeffer · December 13, 2021, 10:00pm

How could multiple workers create a slowdown?

ptrblck · December 13, 2021, 11:13pm

From the linked post:

Beyond an optimal number (experiment!), throwing more worker processes at the IOPS barrier WILL NOT HELP, it’ll make it worse. You’ll have more processes trying to read files at the same time, and you’ll be increasing the shared memory consumption by significant amounts for additional queuing, thus increasing the paging load on the system and possibly taking you into thrashing territory that the system may never recover from