Guidelines for assigning num_workers to DataLoader

In this mode, data fetching is done in the same process a DataLoader is initialized. Therefore, data loading may block computing. However, this mode may be preferred when resource(s) used for sharing data among processes (e.g., shared memory, file descriptors) is limited, or when the entire dataset is small and can be loaded entirely in memory. Additionally, single-process loading often shows more readable error traces and thus is useful for debugging.

Hi, @Qinru_Li. Sorry, I miss the notification. The error that I get is “Killed signal”. Then I updated my PyTorch version to 1.3.1 and I haven’t found any problem since then.
Thank you very much for the response. I will give you an update if I have found same error in latest version of PyTorch.

Best regards,
Albert Christianto

1 Like

To speed up the experiment, I think you should set a large batchsize and if you increase it any more, cuda will be out of memory. Then you should leave the num_worker to be 0.

Larger batch size suppose to reduce the training time ( as more data is processed each iteration).
However, after I do that, I found that the data loading time is nearly doubled when I doubled the batch size.
How can that be? The dataLoading becomes the bottleneck that delays everything up.

Double post from here with a follow-up.

Hi, @ptrblck, thanks for your answers.

I am currently working on a server that must load data from a networking drive (it is designed as it is). I got the error of Device or resource busy like your link when the workers num is greater than 1.

Is there any way to prevent this error? Thanks.

I’m not familiar enough with NFS and don’t know why these temporary files are raising this error using multiple processes.

Thanks for reply.

I am not sure if this error will affect the result since it is still training. Now I am trying different versions of python/pytorch and looking for other ways.

Assuming you are training your model on a clusters. I recently found out if you want to use the resources efficiently. Given a batch_size of 32 and the machine you are running computations has 8 cpus (i.e. num_cpus), I set the num_workers=(batch_size/num_cpus) which 4 workers.

1 Like

@all

Hi All,

I encountered similar issue, but my issue is much more complicated.

I did the following experiments:

num_workers=8 num_workers = 8
CPU i9-9900K(8 cores 16 threads) i7-10700K(8scores 16 threads)
GPU 2080Ti(11GB) 3090(24GB)
CPU RAM used 9.5G unknown
GPU RAM used 7.18G unknown
epoch 300 300
datasets COCO COCO
image count 122218 122218
batch size 8 8
code yolov3 the same yolov3 the same

on 2080Ti computer, the code could be correctly trained.
On 3090 computer, once the taining and validation data were loaded, the code doesn’t go forward. I click the “Pause Program” button on Pycharm, and found that the code stops in the following code:

C:\ProgramData\Anaconda3\Lib\multiprocessing\connection.py:
    def _exhaustive_wait(handles, timeout):
        # Return ALL handles which are currently signalled.  (Only
        # returning the first signalled might create starvation issues.)
        L = list(handles)
        ready = []
        while L:
            res = _winapi.WaitForMultipleObjects(L, False, timeout)
            if res == WAIT_TIMEOUT:
                break
            elif WAIT_OBJECT_0 <= res < WAIT_OBJECT_0 + len(L):
                res -= WAIT_OBJECT_0
            elif WAIT_ABANDONED_0 <= res < WAIT_ABANDONED_0 + len(L):
                res -= WAIT_ABANDONED_0
            else:
                raise RuntimeError('Should not get here')
            ready.append(L[res])
            L = L[res+1:]
            timeout = 0
        return ready

It seems that the code is waiting for thread?

when I clicked the “Pause Program” button on Pycharm, and found that the code stops in the following code:

I implemented the algorithm to find optimal num_workers for fast training.

You can simply find optimal num_workers on any system with this algorithm.

The below code is example code.

import torch
import nws

batch_size = ...
dataset = ...

num_workers = nws.search(dataset=dataset,
                                 batch_size=batch_size,
                                 ...)

loader = torch.utils.data.DataLoader(dataset=dataset,
                                     batch_size=batch_size, 
                                     ...,
                                     num_workers=num_workers, 
                                     ...)

worse it is not working

Could you explain more details about the problem?

i have two dataset like nrrd and tiff file with pytorch loaders i loded data but num_of workers i given was 4
f\get show dataloaders excited error so used u for algorthims it shows same

To me, after some practicality checks, the following worked smoothly:

num_workers attribute in torch.utils.data.DataLoader should be set to 4 * num_GPU, 8 or 16 should generally be good:

  1. num_workers=1, up to 50% GPU, for 1 epoch (106s ovre 1/10 epochs),
    training completes in 43m 24s

  2. num_workers=32, up to 90% GPU, for 1 epoch (42s over 1/14 epochs),
    training completes in 11m 9s

  3. num_workers=8, 16, up to 90% GPU, (8 is slightly better) for 1 epoch (40.5s over 1/16 epochs),
    training completes in 10m 52s

  4. num_workers=0, up to 40% GPU, for 1 epoch (129s over 1/16 epochs),
    training complete in 34m 32s

3 Likes

Where you able to find the solution? I also have 3090 and num_workers>1 causes dataloader to lock up.

To solve this issue:

On windows 10, much much much virtual memory is needed. so you must make sure your OS partition is large enough.

On Linux, big swap partition is necessary.

2 Likes

Well, this appears to me as closely being related to why DataLoaders are there in the first place. Let me elaborate-
Since you set num_workers = 1, there’s only 1 parallel process running to generate the data, that might’ve caused your GPU to sit idle for the period your data gets available by that parallel process. And because of it, the GPU (CUDA) would’ve run out of memory.
As you increase the no. of workers (processes), more processes are now running in parallel to fetch data in batches essentially using CPU’s multiple cores to generate the data = data is available more readily than before = GPU doesn’t have to sit idle and wait for batches of data to get loaded = No CUDA out of memory errors.

[any help will be great appreciated]
i also want to know how the num_worker setting work.
in my cases, i have dataset of videos(seq of ims) , and my batch-size was 1(gpu limitted) that includes many images, so i want to know how to set num_worker,
set 0: it seems to run 5s/items when training,
set 8: it seems to run 1.5s/items when training, but sometimes it will be blocking/suspending for some mins ever some hours, only one cpu core is 100%, other cpus not working, GPU is idle totally, RAM memory still has lots of free space, io/RW keep low or 0), it’s weird.

it seems that when the first epoch work fine, but from the second epoch, the idle intervals will happen

set 16: it seems to be slower than setting 8, and RAM will be full
i usually setting 8 to run, but the GPU has not been taken the most advantage of, it always works a while and sit idle a while

I’m also getting NaNs in my loss.

So far, can train on Windows with a single GPU and num_workers > 0 with no issue.

Will get NaN/inf after 30-50 epochs when:

  • Running multiple GPU, num_workers = 0 on Linux
  • Running multiple GPU, num_workers > 0 on Linux
  • Single GPU, num_workers > 0 on Linux

I don’t get NaN / Inf when running single GPU, num_workers = 0 on Linux which makes me believe it’s something about multiprocessing.

Has anyone seen this issue?