Guidelines for assigning num_workers to DataLoader

Is your data stored on an SSD or HDD?
I would assume you have an IO bottleneck.
Is your complete training script faster if you use num_workers=0 (e.g. one complete epoch)?

1 Like

I stored the image with .pt format, I do not know too much the format…

Actually, I tried num_workers=0, it is faster to load each image than num_workers=8. However, it each time, the CPU will load less batches, e.g, 2 batches to train, then it wait several minutes to load another two batches, etc… For the case of num_workers=8, it will one time load 12 batches, but for each image, it will take longer time, i guess it is because of the subprocess of CPU. After forward and backward the image into the NN, it will wait several minutes to load another 12 batches data.

The number of batches here is just an example, but generally is in my case. I found that to set the num_workers properly really depends on the num_cpu, num_gpu and batch_size of the workstation…
If I understand it correctly, If I have 8 CPUs and used 1 GPU for my NN, I set the batch_size to be 8, so basically, each CPU will take care of one image, right? Based on the GPU memory saving for data, it will transfer the batches of data onto GPU to train, until the training on GPU ends, it will gather new batches of data?

Finally, I set num_workers=8 and batch_size=4 to train my auto-encoder. The speed is acceptable.

I also have a similar problem.

I am trying to implement the distributed training from PyTorch examples with 4 GPUs (one sub-process for each GPU), but when I set num_workers>0 for each subprocess the training just becomes extremely slow and I have no idea why it happens. However, num_workers=0 works pretty nice.

Is it the reason that calling sub-process inside sub-process causes problem for CPU?

1 Like

@YossiB You can pre-load the images in memory and keep them there to avoid loading them every time from disc (I/O is time consuming).

1 Like

I am traped about the same issue. I have 700k images ,but my memory of PC is 64G ,so it can’t load all images. i use the Dataloader to load the images. but it’s much much slow. but when i reduce the number of images ,like just 10k for train. the speed is normal. so, i want to restore my data in SSD. will it work?

what if my memory is not enough?will SSD work?

I think SSD discs will speedup disc I/O, but it won’t solve your problem entirely since you still need at some point to do disc access.

I never came across this issue, but here some pointers:

  1. Use 1 dataset, 1 dataloader over the entire data. Load only samples when needed (i.e., the minibatch). After processing the minibatch, delete it. This goes under the risque that the concurrence between the process will create a situation where the processes load faster the minibatches and saturate the RAM. Moreover, you need to access to disc for every sample every time.
  2. Split randomly the entire data into BIG chunks (k chunks), at every epoch. Then, load a chunk into memory using torch.utils.data.Dataset and torch.utils.data.DataLoader. Process this chunk as any dataset. Then, delete the dataset and the dataloader. The advantage is that you can control the number of samples loaded into the memory to avoid overloading it. Something like this:
size_chunk = 10000
nbr_chuncks = 70
for i in range(nbr_chunks):
    # Create a dataset that contains only the current chunk.
    # This requires disc access, and your SSD can boost the speed.
    # However, this remains an issue since you will need to reload EVERY SAMPLE EVERY TIME.
    dataset_i = torch.utils.data.Dataset(chunk_i)
   # Create a dataloader that loads only this chunk, and splits it into minibatches.
   dataloader_chunk_i = torch.utils.data.DataLoader(dataset_i)
   # do your training over the current minibatches:
   for j, (data, label) in enumerate(datalaoder_chunk_i):
       # Process this minibatch: forward data, compute loss, update params, and all that.
  # Now you are done with this chunk. Get rid of it from the memory: Delete it to free the memory.
  del dataset_i
  del dataloader_chunk_i
  1. As you can see this is problematic due to the limited space of the memory (RAM). Another way that you can try while avoiding the above annoying chunking process and keep the standard data loading is to find a workaround to load the entire dataset into memory!!! you can try to compress samples on disc. Then, load all samples into memory, and keep them compressed. Uncompress a sample ONLY when NEEDED. Once you finished working on that sample, delete the uncompressed version, and keep only the compressed one. At any time, only the minibatch size of samples is decompressed, while everything is compressed to preserve the memory space. This can be a solution if the required time to read from disc is slower than decompressing a compressed sample.

Please let me what option works better for you!

4 Likes

Thanks for your advices, for now,I have restore my data in SSD,and the training speed is up . I think the tricks you introduce is all I need to speed up. It’s important to improve the perfermance of my model,so ,I will check the trick above some time later and inform you what option works better for me

when using 8 GPU for training,set num_workers = 0 it works,when using 4 or 8 it stop there.maybe the prtorch’s bug?

1 Like

So your saying that with two 2080 ti’s, it took you about 30 seconds longer to train than with one 2080 ti?
I don’t think using more GPUs should ever slow down epoch training time, if parallelization is working properly. AFAIK there’s no correlation between training convergence and num_workers, since num_workers doen’t deal with backpropagation, only data loading IO.

Yeah I’ve since changed tune about convergence. Agree. That’s what I got when batch size was the same, so parallelisation was NOT working properly. Overheads of parallelisation don’t pay off without increasing batch size. Right?

1 Like

I find that num_workers == batchsize_per_gpu works best for my model, for example, if u have 2 GPUs and ur batchsize is 6( 3 batches per gpu ), then set num_workers=3 may work.

1 Like

My previous timings re num_workers was confounded by my hardware. Now realise that my bottleneck was SATA SSD and disk reads with individual hdf5s for each image. So increasing num_workers past 4, for my machine, had no effect because of this rate-limiting hardware.

1 Like

Hi, Stephen! It is a funny coincidence that my family name is Song, too.
And I think, in a common case, the batch size is quite large, such as, 32. And if you have 2 GPUs, then you may set the batch size as 32. However, if we consider the threads number of CPU, then it seems the 32 might be larger than the CPU threads number, which is not that reasonable.

Wow,that makes sense. Actually,I never tried num_workers with a number larger then 8. I think choosing a proper number with limited choices(eg. [2 3 4 5 6 7 8]) may be enough to get the optimal setting.

发自我的iPhone

OK, thank you for your reply.

Hi,
I really need to use multi-thread while loading my dataset.
can anyone give me a suggestion to use “num_workers” or a reference to write dataset with multiprocessing?
here are some important parameter
batch size:10
GPU: NVIDIA 2080 Ti
num_workers that i have tried (3,4,5,6,7,8) and it gives me killed signal. (still trying num_workers=1 and 2)
best regards,
albert christianto

Dear @ptrblck , is there any way to load the dataset directly to the GPU with Dataset and DataLoader?

And I am not sure how to say if my dataset is big or small, so I’d love to try both ways to feed data to my model

This may help ZhiHu

import torch
from torchvision import datasets, transforms
import time


if __name__ == '__main__':
    use_cuda = torch.cuda.is_available()

    for num_workers in range(0,50,5):  # 遍历worker数
        kwargs = {'num_workers': num_workers, 'pin_memory': False} if use_cuda else {}
        train_loader = torch.utils.data.DataLoader(
            datasets.MNIST('./data', train=True, download=True,
                           transform=transforms.Compose([
                               transforms.ToTensor(),
                               transforms.Normalize((0.1307,), (0.3081,))
                           ])),
            batch_size=64, shuffle=True, **kwargs)



        start = time.time()
        for epoch in range(1, 5):
            for batch_idx, (data, target) in enumerate(train_loader): # 不断load
                pass
        end = time.time()
        print("Finish with:{} second, num_workers={}".format(end-start,num_workers))
5 Likes

Sure, it’s possible but you might consider a few shortcomings.
If you are dealing with a (preprocessed) array / tensor, you could simply load it, push to the device and index it to create batches. A DataLoader might be used, but e.g. multiple workers most likely won’t help much speeding up your data pipeline, as the data is already on the GPU.

If you want to apply some data augmentation methods, you would need to apply them on the GPU. Since a lot of torchvision transformations are written using PIL, you would have to use another library or implement it manually.

Also note, that your data will use memory on your device, which cannot be used by the model anymore.

5 Likes