Guidelines for assigning num_workers to DataLoader

Here is an example
CPU:Intel® Core™ i7-8700 CPU @ 3.20GHz × 12
GPU:GeForce RTX 2080/PCIe/SSE2

The best performance from pin_memory = True and num_workers=5

pin_memory = True
print('pin_memory is', pin_memory)
for num_workers in range(0, 20, 1):  # 遍历worker数
    training_dataloader, len_training = dataloder_constructure(validationinputPath, validationdatasetPath,
                                                               num_workers=num_workers,
                                                               pin_memory=pin_memory)
    start = time.time()
    for epoch in range(1, 5):
        for i, data in enumerate(training_dataloader, 0):
            pass
    end = time.time()
    print("Finish with:{} second, num_workers={}".format(end - start, num_workers))

pin_memory is True
Finish with:5.5445404052734375 second, num_workers=0
Finish with:3.923471689224243 second, num_workers=1
Finish with:2.1549131870269775 second, num_workers=2
Finish with:1.5953092575073242 second, num_workers=3
Finish with:1.3868026733398438 second, num_workers=4
Finish with:1.2384390830993652 second, num_workers=5
Finish with:1.3161952495574951 second, num_workers=6
Finish with:1.3683111667633057 second, num_workers=7
Finish with:1.3703277111053467 second, num_workers=8
Finish with:1.4161624908447266 second, num_workers=9
Finish with:1.4753594398498535 second, num_workers=10
Finish with:1.4522082805633545 second, num_workers=11
Finish with:1.542546033859253 second, num_workers=12
Finish with:1.670602798461914 second, num_workers=13
Finish with:1.733013391494751 second, num_workers=14
Finish with:1.8041796684265137 second, num_workers=15
Finish with:1.8382213115692139 second, num_workers=16
Finish with:1.9090533256530762 second, num_workers=17
Finish with:1.9829919338226318 second, num_workers=18
Finish with:2.014136552810669 second, num_workers=19

pin_memory = False
print('pin_memory is', pin_memory)
for num_workers in range(0, 20, 1):  # 遍历worker数
    training_dataloader, len_training = dataloder_constructure(validationinputPath, validationdatasetPath,
                                                               num_workers=num_workers,
                                                               pin_memory=pin_memory)
    start = time.time()
    for epoch in range(1, 5):
        for i, data in enumerate(training_dataloader, 0):
            pass
    end = time.time()
    print("Finish with:{} second, num_workers={}".format(end - start, num_workers))

pin_memory is False
Finish with:3.4909846782684326 second, num_workers=0
Finish with:3.9767868518829346 second, num_workers=1
Finish with:2.2422804832458496 second, num_workers=2
Finish with:1.648954153060913 second, num_workers=3
Finish with:1.3978724479675293 second, num_workers=4
Finish with:1.3549144268035889 second, num_workers=5
Finish with:1.3050360679626465 second, num_workers=6
Finish with:1.7531778812408447 second, num_workers=7
Finish with:1.3492858409881592 second, num_workers=8
Finish with:1.509387493133545 second, num_workers=9
Finish with:1.4594593048095703 second, num_workers=10
Finish with:1.6034505367279053 second, num_workers=11
Finish with:1.5982444286346436 second, num_workers=12
Finish with:1.7085340023040771 second, num_workers=13
Finish with:1.738295555114746 second, num_workers=14
Finish with:1.8579375743865967 second, num_workers=15
Finish with:1.8640763759613037 second, num_workers=16
Finish with:1.9010047912597656 second, num_workers=17
Finish with:1.9985153675079346 second, num_workers=18
Finish with:2.2698612213134766 second, num_workers=19

9 Likes

But when i keep number of workers >=0 it doesn’t work.I don’t know why?
Can you please tell me what’s the problem?
Thanks

Hi Albert, what error you are getting?

thank you for this recommendation. if I understand correctly, this formula “works” for data loaded into GPU. For loading data in CPU, would you happen to have a corresponding formula too? e.g. k * num_CPU

I wish I had seen this thread earlier.
In my experience, the num_workers has to be optimzed by various factors including the number and the size of data that will be loaded, the speed of CPU, GPU and SSD, and of course the number of CPU cores.
Due to difficulty to get all the information about H/W, I guess it will be hard to implement the automated optimization.

1 Like

I have received some bad advice on this, and I would like to make the learning experience easier for others. Here is my process of identifying suitable values for cpu cores, num_workers, and batch_size. Please extend and critique as necessary.

My particular case is on data that is in a numpy-array in RAM (25M samples, 12 float features). It does not fit on the GPU with the model, hence I am using a data loader. I am running my experiments on a computational cluster, but effectively I am only reserving one GPU (Tesla P100) and at most 32 CPU cores that is on the same physical server. I took a “greedy hill-climbing” -approach to this, i.e. tried to optimize one parameter at a time and fiddled around with the selected parameter until I found an optimum and then moved on to the next. I also revisited previous parameters once I thought I had found a reasonable optimum for all parameters. All in all, I reduced the training time to around 25% of what it initially was.
I specifically trained a net for 10 epochs with all of the configurations, i.e. I did not train to convergence. It might be that batch_size affects training time as I update the weights of the net for every batch separately.

CPU Cores:
As expected, the training time reduced with more CPU cores. However, the improvements seem to pan out at 16 cores.

num_workers:
Increasing num_workers reduced training time, but I saw no improvement whatsoever after num_workers exceeded the number of CPU cores.
-There is quite a lot of advice indicating that the num_workers should be twice as large as the number of CPU cores, but this was not the case for me.

batch_size:
This is perhaps what surprised me most. It turned out that around 200k samples in a batch brought the best performance for me. Remember that my samples are simply vectors with 12 floats. Decreasing the batch size to half of that increased the running time by 15%, doubling the batch size increased running time by 35%. As I was initially off by order of magnitude, I was able to drop running time by a factor of 2 by adjusting the batch size alone.

One additional thing that could perhaps change training time significantly is the learning rate. Testing that would take very long as I would need the network to actually converge.

9 Likes

I recall Jeremy Howard saying that large batches are faster because the GPU kernel rearranges itself before each batch (loads data, and possibly resets other parameters?), and that takes time. For example, if you were running 768 MNIST images, if you used a batch size of 1 the GPU would have to set up the kernel 768 times for the computation, whereas if you used a batch size of 64, then it would only restart 12 times.

I think he means number of GPUs you are using for training.

Hi @ptrblck,

When I set the num_workers to be greater than zero, I start to get these pymp files out of nowhere (and there are a lot of them).

I also get the following error message:

When I set num_workers=0, everything is fine. .

I’m just wondering if you know what the problem is here.

Thank you so much!

The pymp folders should be temporary folders created by Python’s multiprocessing library and should be deleated once the process exits. If the script crashes and these folders are not deleted, you might want to delete them manually (and if it happens it seems to be a bug in Python).

.nsf files are placeholders on your NFS server. It seems that you might read or write data to this NFS server? If you are reading from it, note that this might be a potential bottleneck in your training routine.
This issue might be related to your error.

1 Like

Hi @ptrblck,

Thank you for your reply.

It seems that my model is still training even if I got these error messages. Does it mean I can ignore these errors and just let the model train until it finishes? Is the final model trained like this still reliable?

I can’t comment on the reliability of your model and would try to avoid these errors.
Are you using a network drive and if so would it be possible to copy the data onto a local drive?

Hi @ptrblck,

Thank you so much for your reply!
I read it here that someone had a somewhat related problem, and the solution was to increase the open files limit. Do you think this might also work for my problem?

It might solve your issue and it definitely worth a try.
However, if I’m correct in the assumption that you are using a network drive, I would still recommend to copy the data into a local SSD (if possible of course).

In this mode, data fetching is done in the same process a DataLoader is initialized. Therefore, data loading may block computing. However, this mode may be preferred when resource(s) used for sharing data among processes (e.g., shared memory, file descriptors) is limited, or when the entire dataset is small and can be loaded entirely in memory. Additionally, single-process loading often shows more readable error traces and thus is useful for debugging.

Hi, @Qinru_Li. Sorry, I miss the notification. The error that I get is “Killed signal”. Then I updated my PyTorch version to 1.3.1 and I haven’t found any problem since then.
Thank you very much for the response. I will give you an update if I have found same error in latest version of PyTorch.

Best regards,
Albert Christianto

1 Like

To speed up the experiment, I think you should set a large batchsize and if you increase it any more, cuda will be out of memory. Then you should leave the num_worker to be 0.

Larger batch size suppose to reduce the training time ( as more data is processed each iteration).
However, after I do that, I found that the data loading time is nearly doubled when I doubled the batch size.
How can that be? The dataLoading becomes the bottleneck that delays everything up.

Double post from here with a follow-up.

Hi, @ptrblck, thanks for your answers.

I am currently working on a server that must load data from a networking drive (it is designed as it is). I got the error of Device or resource busy like your link when the workers num is greater than 1.

Is there any way to prevent this error? Thanks.