Dataloader with `num_workers > 0` only using CPU in main process?

After a lot of debugging, I’ve found that my worker processes end up using no CPU and only the main process seems to be using CPU to preprocess the batch data. As far as I understand, each of the worker processes should use the __getitem__ function of the DataLoader independently (during which I just load NumPy files and perform transformations on them). Perhaps all the workers are relying on the main process to perform the transforms for some reason? Any suggestions on to why this might be?

Original post:

I’m having trouble finding the bottleneck in training time for my code. For some reason, it appears setting num_workers=0 results in the fastest training. I ran a series of experiments comparing training speeds under various conditions, and I can’t seem to figure out why num_workers=0 would produce the fastest speed.

During my experiments, I’ve been monitoring the CPU and memory (using glances) and neither seems to come anywhere near being maximally used. Except, in the case where I’m using num_workers=0, then the CPU the main process is running on is almost always at 100% usage. Again though, for some reason, this is when training seems to run the fastest. The GPU processing only spikes occasionally (when a batch is being run), but most of the time it just seems to be idling. Turning pin_memory off and on for the DataLoaders results in no speed difference. Reading data from the SSD does not seem to be the limiting factor (I tried setting the DataLoader to just remember the examples it already loaded and not load another file to test if disk read speed was limiting, but even with this setup, num_workers=0 was still fastest). The only factor I can think of that I’m not directly monitoring is the transfer speed from memory to GPU memory, but I wouldn’t expect this to be the limiting factor, and I would expect pin_memory to result in a change here.

Just in case some more details are helpful, my speed experiments consisted of the following. Each batch consists of 100 images. The images and their corresponding labels have significant transforms applied to them in the DataLoader. A patch of the image is extracted and the remainder of the transforms depends on what’s in that patch, so these transforms can only be preformed after the patch is extracted. Trying to save every permutation of this preprocessed data would require too much storage, which is why the transforms are applied during training. Timing 10 batches of processing with num_workers=0 takes ~5.5s. With num_workers=1 it takes ~7s and num_workers=4 takes ~10s (again noting that each of these trials with pin_memory seemed to have no impact on speed).

Given that with num_workers=0, I can clearly see the individual CPU core the main process is running on is nearly always at 100%, it seems strange that adding more workers would decrease the speed. And when running multiple workers it’s slower despite no obvious bottleneck seems strange to me. Can anyone suggest what I might be doing wrong? Thank you for your time.

Original post update (GAN):
After writing the above, I considered one additional factor that may be worth noting. I’m working with a GAN, where both labeled and unlabeled data is being passed to the network. Because of this, I have two DataLoaders passing batches to the network. Does having two separate DataLoaders trying to access the same input to the network cause trouble (especially in regards to pinning memory)? And if so, is there a way to avoid this problem?

Original post update 2 (Workers don’t seem to be using CPU):
Strangely, it doesn’t seem any of my worker processes end up using any CPU. As far as I understand, each of the worker processes should use the __getitem__ function of the DataLoader independently (during which I just load NumPy files and perform transformations on them). Perhaps all the workers are relying on the main process to perform the transforms for some reason? Any suggestions on to why this might be?

I solved the problem. The problem was because the dataset was listed as only having a length of the same size as the batch size. And it seems the DataLoader will “hand out” remaining dataset indexes left to be processed to its workers. When the batch size is the dataset length, then only 1 process will be given indexes to preprocess.

I had done this because only a small patch of each full image was used during processing, and this patch was randomly chosen from each image during the transform part of the preprocessing. Since only a tiny part of each image was used (and the total number of full size images were less than the desired batch size), it didn’t make sense to use the actual length of the dataset as the length listed as the length attribute of the dataset. In the end, to utilize each worker of the DataLoader, I need to set the dataset length to be num_workers * batch_size. This way, the DataLoader can give each worker batch_size number of indexes to work on.