Dataloader creates a new PROCESS for every worker, and the dataset has to be “copied” to every worker. Depending on whether you’re on Windows or Linux (spawning a process in Windows is much more expensive than forking one in Linux), and how the dataset stores it’s data (Tensors don’t seem to get copied from what I’ve tested, but Python structures, yes), you might have a very high overhead for creating the processes.
Unless you’re working with some supercomputer, I believe 8 workers is more than enough.
Also, most importantly, check your RAM usage, if your OS starts swapping memory to disk, it might get extremely slow.
Reduce the number of workers, and check for improvements.
Your problem might be that there simply isn’t that much to do for the GPU (if your model is very small, or your batch size is very small, for example), not necessarily that it’s waiting for data.
An easy way to check is to look for “pits” in GPU usage: if there are times the GPU usage suddenly decreases, it’s probably waiting for data (although you probably won’t be able to see this now, as your “max” appears to be 1%)