After a lot of debugging, I’ve found that my worker processes end up using no CPU and only the main process seems to be using CPU to preprocess the batch data. As far as I understand, each of the worker processes should use the __getitem__
function of the DataLoader independently (during which I just load NumPy files and perform transformations on them). Perhaps all the workers are relying on the main process to perform the transforms for some reason? Any suggestions on to why this might be?
Original post:
I’m having trouble finding the bottleneck in training time for my code. For some reason, it appears setting num_workers=0
results in the fastest training. I ran a series of experiments comparing training speeds under various conditions, and I can’t seem to figure out why num_workers=0
would produce the fastest speed.
During my experiments, I’ve been monitoring the CPU and memory (using glances) and neither seems to come anywhere near being maximally used. Except, in the case where I’m using num_workers=0
, then the CPU the main process is running on is almost always at 100% usage. Again though, for some reason, this is when training seems to run the fastest. The GPU processing only spikes occasionally (when a batch is being run), but most of the time it just seems to be idling. Turning pin_memory
off and on for the DataLoader
s results in no speed difference. Reading data from the SSD does not seem to be the limiting factor (I tried setting the DataLoader to just remember the examples it already loaded and not load another file to test if disk read speed was limiting, but even with this setup, num_workers=0
was still fastest). The only factor I can think of that I’m not directly monitoring is the transfer speed from memory to GPU memory, but I wouldn’t expect this to be the limiting factor, and I would expect pin_memory
to result in a change here.
Just in case some more details are helpful, my speed experiments consisted of the following. Each batch consists of 100 images. The images and their corresponding labels have significant transforms applied to them in the DataLoader
. A patch of the image is extracted and the remainder of the transforms depends on what’s in that patch, so these transforms can only be preformed after the patch is extracted. Trying to save every permutation of this preprocessed data would require too much storage, which is why the transforms are applied during training. Timing 10 batches of processing with num_workers=0
takes ~5.5s. With num_workers=1
it takes ~7s and num_workers=4
takes ~10s (again noting that each of these trials with pin_memory
seemed to have no impact on speed).
Given that with num_workers=0
, I can clearly see the individual CPU core the main process is running on is nearly always at 100%, it seems strange that adding more workers would decrease the speed. And when running multiple workers it’s slower despite no obvious bottleneck seems strange to me. Can anyone suggest what I might be doing wrong? Thank you for your time.
Original post update (GAN):
After writing the above, I considered one additional factor that may be worth noting. I’m working with a GAN, where both labeled and unlabeled data is being passed to the network. Because of this, I have two DataLoader
s passing batches to the network. Does having two separate DataLoader
s trying to access the same input to the network cause trouble (especially in regards to pinning memory)? And if so, is there a way to avoid this problem?
Original post update 2 (Workers don’t seem to be using CPU):
Strangely, it doesn’t seem any of my worker processes end up using any CPU. As far as I understand, each of the worker processes should use the __getitem__
function of the DataLoader independently (during which I just load NumPy files and perform transformations on them). Perhaps all the workers are relying on the main process to perform the transforms for some reason? Any suggestions on to why this might be?