Guidelines for assigning num_workers to DataLoader

I’m not familiar enough with NFS and don’t know why these temporary files are raising this error using multiple processes.

Thanks for reply.

I am not sure if this error will affect the result since it is still training. Now I am trying different versions of python/pytorch and looking for other ways.

Assuming you are training your model on a clusters. I recently found out if you want to use the resources efficiently. Given a batch_size of 32 and the machine you are running computations has 8 cpus (i.e. num_cpus), I set the num_workers=(batch_size/num_cpus) which 4 workers.

1 Like


Hi All,

I encountered similar issue, but my issue is much more complicated.

I did the following experiments:

num_workers=8 num_workers = 8
CPU i9-9900K(8 cores 16 threads) i7-10700K(8scores 16 threads)
GPU 2080Ti(11GB) 3090(24GB)
CPU RAM used 9.5G unknown
GPU RAM used 7.18G unknown
epoch 300 300
datasets COCO COCO
image count 122218 122218
batch size 8 8
code yolov3 the same yolov3 the same

on 2080Ti computer, the code could be correctly trained.
On 3090 computer, once the taining and validation data were loaded, the code doesn’t go forward. I click the “Pause Program” button on Pycharm, and found that the code stops in the following code:

    def _exhaustive_wait(handles, timeout):
        # Return ALL handles which are currently signalled.  (Only
        # returning the first signalled might create starvation issues.)
        L = list(handles)
        ready = []
        while L:
            res = _winapi.WaitForMultipleObjects(L, False, timeout)
            if res == WAIT_TIMEOUT:
            elif WAIT_OBJECT_0 <= res < WAIT_OBJECT_0 + len(L):
                res -= WAIT_OBJECT_0
            elif WAIT_ABANDONED_0 <= res < WAIT_ABANDONED_0 + len(L):
                res -= WAIT_ABANDONED_0
                raise RuntimeError('Should not get here')
            L = L[res+1:]
            timeout = 0
        return ready

It seems that the code is waiting for thread?

when I clicked the “Pause Program” button on Pycharm, and found that the code stops in the following code:

I implemented the algorithm to find optimal num_workers for fast training.

You can simply find optimal num_workers on any system with this algorithm.

The below code is example code.

import torch
import nws

batch_size = ...
dataset = ...

num_workers =,

loader =,

worse it is not working

Could you explain more details about the problem?

i have two dataset like nrrd and tiff file with pytorch loaders i loded data but num_of workers i given was 4
f\get show dataloaders excited error so used u for algorthims it shows same

To me, after some practicality checks, the following worked smoothly:

num_workers attribute in should be set to 4 * num_GPU, 8 or 16 should generally be good:

  1. num_workers=1, up to 50% GPU, for 1 epoch (106s ovre 1/10 epochs),
    training completes in 43m 24s

  2. num_workers=32, up to 90% GPU, for 1 epoch (42s over 1/14 epochs),
    training completes in 11m 9s

  3. num_workers=8, 16, up to 90% GPU, (8 is slightly better) for 1 epoch (40.5s over 1/16 epochs),
    training completes in 10m 52s

  4. num_workers=0, up to 40% GPU, for 1 epoch (129s over 1/16 epochs),
    training complete in 34m 32s


Where you able to find the solution? I also have 3090 and num_workers>1 causes dataloader to lock up.

To solve this issue:

On windows 10, much much much virtual memory is needed. so you must make sure your OS partition is large enough.

On Linux, big swap partition is necessary.


Well, this appears to me as closely being related to why DataLoaders are there in the first place. Let me elaborate-
Since you set num_workers = 1, there’s only 1 parallel process running to generate the data, that might’ve caused your GPU to sit idle for the period your data gets available by that parallel process. And because of it, the GPU (CUDA) would’ve run out of memory.
As you increase the no. of workers (processes), more processes are now running in parallel to fetch data in batches essentially using CPU’s multiple cores to generate the data = data is available more readily than before = GPU doesn’t have to sit idle and wait for batches of data to get loaded = No CUDA out of memory errors.

[any help will be great appreciated]
i also want to know how the num_worker setting work.
in my cases, i have dataset of videos(seq of ims) , and my batch-size was 1(gpu limitted) that includes many images, so i want to know how to set num_worker,
set 0: it seems to run 5s/items when training,
set 8: it seems to run 1.5s/items when training, but sometimes it will be blocking/suspending for some mins ever some hours, only one cpu core is 100%, other cpus not working, GPU is idle totally, RAM memory still has lots of free space, io/RW keep low or 0), it’s weird.

it seems that when the first epoch work fine, but from the second epoch, the idle intervals will happen

set 16: it seems to be slower than setting 8, and RAM will be full
i usually setting 8 to run, but the GPU has not been taken the most advantage of, it always works a while and sit idle a while

I’m also getting NaNs in my loss.

So far, can train on Windows with a single GPU and num_workers > 0 with no issue.

Will get NaN/inf after 30-50 epochs when:

  • Running multiple GPU, num_workers = 0 on Linux
  • Running multiple GPU, num_workers > 0 on Linux
  • Single GPU, num_workers > 0 on Linux

I don’t get NaN / Inf when running single GPU, num_workers = 0 on Linux which makes me believe it’s something about multiprocessing.

Has anyone seen this issue?