The data loading time will always increase when increase the dataloader woker numbers?

In my imagenet task, I use lmdb file as dataset, here is my dataset:

class ImageNetDataset(

def __init__(self, config, phase='train', transforms=None):
    data_root = config.DATASET.ROOT
    self.phase = phase
    self.lmdb_file = os.path.join(data_root, self.phase+'.lmdb')
    label_file = os.path.join(data_root, self.phase+'_label.txt')
    self.transforms = transforms
    self.lines = []
    with open(label_file) as f:
        self.lines = f.readlines()
    self.txn = None
    self.workers = 32
    if not self.txn:
        env =
        self.txn = env.begin(write=False)

def __len__(self):
    return len(self.lines)

def __getitem__(self, idx):
    line = self.lines[idx]
    image_name = line.split()[0]
    image =
    if image.getbands() != 3:
        image = image.convert('RGB')
    gt = int(line.split()[-1])
    image = self.transforms(image)
    return image, torch.tensor(gt, dtype=torch.long)

I found that using num_workers=16 is faster than num_workers=32, and the cpu utilization is higher. I am not sure if my dataset is wrong, or some reason else case this problem?
This situation get worse when I use distributed training set(one subprocess per gpu(total gpu:4), num_workers=16), I have many cpu, but at this situation, the utilization of cpus is slow, is that the subprocess(the get_item of dataset) of subprocess(main_worker) can not take advantage of all cpu?

1 Like

I have a similar question that I don’t know if there is a good way to decide the value of DataLoader.num_workers.

An advisor of mine told me that this parameter might have some relation to the usage of RAM.

Do you mean that larger workers number with damage the speed when the program reach the maximum value of memory? In my program, the problem will appear while half of the memory is not being used.

By having more processes simulatenously doing random access IO, good chance you’ll start overloading whatever IO device you’re reading from, it’s not a friendly read pattern and you’ll likely have a lot of processes blocked on IO. There will be a number of workers beyond which there is no point in increasing for your system, start small and increase until it stops improving.

More worker processes will also increase the memory utilization.

1 Like

Hi, Ross! Thanks for your reply and it inspire me a lot. And I am wondering, is there any fast way to decide the number of worker processes?
Looking forward to your rely!

The end of this thread covers it pretty well, including some measurements of a specific scenario by @michaelklachko : How to prefetch data when processing with GPU?

TLDR: my rule of thumb is I usually make workers 0 to 2 processes less than the total number of logical CPU cores my CPU has when summing across all distributed training processes running on that machine.

When debugging, ‘htop’ can help once your training process is running and in a steady state. If every single core is completely maxed out, and especially if there is a lot of red (kernel time), you might want to try backing off the worker count a bit to see if the throughput improves. Also, can be worth checking the output of ‘i7z’ to make sure your CPU is running at proper levels and not being throttled. Sometimes the power state governor in your Linux install can be overly conservative, keeping the frequencies down.

The best easy way to cut back on some CPU usage for typical dataset/augmentation setups for image problems is to replace Pillow with Pillow-SIMD (it’s a pain to maintain the package dependencies, but usually worth the pain). Pillow is not the most efficient imaging library. Beyond that, switching to a CV2 image pipeline, using DALI, or perhaps trying something like Kornia could give you back some CPU cycles.

1 Like

@rwightman, @songyuc I did some experimentation with number of workers, and I can say that the best way to find the optimal one is to run a test over a range of values, for maybe 100 batches. It is highly dependent on particular combination of a CPU, storage, dataloader type, preprocessing methods, and model type/size. I’m afraid there’s not magic formula to rely on. For example, I found that on my 6 core CPU, with DataParallel, standard Imagenet preprocessing for Resnets, vanilla Resnet-18 model, and Samsung 850 Pro SSD, DALI-CPU works best, and the optimal number of workers is 12. But the range 8-16 is nearly as fast, which is counterintuitive given number of cores/threads.

P.S. and the difference between optimal and default params can be quite significant - for example, in the above scenario, using the default num_workers=4 will lead to ~20% slow down.

1 Like

Thank you for the experimentation and the sharing, which inspire me a lot.

And I am also wondering the number of CPU threads, which I think might account for the optimal number 12, because the CPU threads number is always approximately two times as the cores number.
Looking forward to your reply.