The data loading time will always increase when increase the dataloader woker numbers?

rwightman · December 17, 2019, 8:05pm

The end of this thread covers it pretty well, including some measurements of a specific scenario by @michaelklachko : How to prefetch data when processing with GPU?

TLDR: my rule of thumb is I usually make workers 0 to 2 processes less than the total number of logical CPU cores my CPU has when summing across all distributed training processes running on that machine.

When debugging, ‘htop’ can help once your training process is running and in a steady state. If every single core is completely maxed out, and especially if there is a lot of red (kernel time), you might want to try backing off the worker count a bit to see if the throughput improves. Also, can be worth checking the output of ‘i7z’ to make sure your CPU is running at proper levels and not being throttled. Sometimes the power state governor in your Linux install can be overly conservative, keeping the frequencies down.

The best easy way to cut back on some CPU usage for typical dataset/augmentation setups for image problems is to replace Pillow with Pillow-SIMD (it’s a pain to maintain the package dependencies, but usually worth the pain). Pillow is not the most efficient imaging library. Beyond that, switching to a CV2 image pipeline, using DALI, or perhaps trying something like Kornia could give you back some CPU cycles.