Training faster with single gpu

Is there a way to train faster on a single gpu, I have a 16 core cpu, and it seems like increasing the number of workers on the dataloader slows down training? Can I get any hints in this regards. I’m somewhat new in this domain. Thank you.

This post gives a great overview how to handle data loading bottlenecks.
In particular this section might be interesting for you:

Beyond an optimal number (experiment!), throwing more worker processes at the IOPS barrier WILL NOT HELP, it’ll make it worse. You’ll have more processes trying to read files at the same time, and you’ll be increasing the shared memory consumption by significant amounts for additional queuing, thus increasing the paging load on the system and possibly taking you into thrashing territory that the system may never recover from