Slow training, looking for errors / code improvements

Hi there!
I am currently trying to train a cGAN model from scratch. It works fine, however, the training seems very slow for the hardware I use (4090). To learn the most from this issue, I would like to spot an error in my code first. I am not sure if I can just copy paste the whole model (200 lines of code), so maybe someone experience a similar problem or is quite experienced and can tell me what possibly could be wrong.

For example, currently I am using

    dataloader = torch.utils.data.DataLoader(
        dataset,
        shuffle=True,
        batch_size=124,
        drop_last=True,
    )

Adding num_workers = 4, the training is approx 5 times faster, but I am not sure if this is a good thing to change. However, since small changes like that have such a high impact, I wonder what else there is to optimize.

Using multiprocessing to load and process the data is a good approach and would avoid using the main process for this task (which is also responsible to launch the CUDA kernels etc.).
I’m not sure what concern you have regarding this change.

I would recommend checking our Performance Tuning Guide for general tips.

Thanks for your help! I wonder, how does one determine the best amount of num_workers? is there something like a rule of thumb, or is it simply trial and error.

I think generally users don’t increase the number of workers beyond their CPU cores (and I think higher-level APIs such as Lightning also raise a warning or set num_workers to your CPU count). You could run a few experiments as also other users reported diminishing returns for num_workers>4 but as usual it would all depend on your system and workload.