Processing the augmentation in CPU in parallel with GPU

I’m trying to get the most out of my machine. I’ve tried many things already but I believe there’s still more to squeeze out of it. It’s just that no matter what I do, I cannot get the last drops.

This is my hardware’s current utilization state:

I labelled the charts for CPU and GPU. The chart for the CPU is on the top and for the GPU at the bottom and they grow from the middle out (name of the program: btop). My question is how can I move the two peaks within the blue rectangle to the left (charts are moving from right to left). The peaks are CPU processing the augmentation (images). And as far as I can tell, they can happen in parallel with the GPU processing the current batch. But for some reason, no matter what I do, their processing does not start until the GPU is done with the previous batch. BTW, there are two pairs of peaks since one of them is for the training dataset and the other for the validation dataset.

These are what I’ve done so far:

  1. AMP (Automatic Mixed Precision): I know it has nothing to do with what I’m asking but I thought I should mention it.
  2. num_workers=8: My CPU has 8 cores and 16 threads. Through experimentation, I realized that more than 8 workers does not help.
  3. prefetch_factor=2: Technically, I didn’t set this parameter and 2 is the default value. I did experiment with higher numbers but it didn’t help either so I changed it back.
  4. pin_memory=False: I did turn this feature on at one point. And it did help but only like 10%. The shape of the charts was more or less the same as I shown here. Perhaps, the only difference was that the saw teeth effect of the peaks were smooth. But I had to turn it off since I was randomly running out of vRAM.

And I’m out of ideas. I don’t understand why the processing of the next batch cannot be done in parallel when GPU is working on the previous one. Any ideas?

I don’t know what the CPU util. peaks represent and if it’s related to loading and processing a single batch of the entire training and validation datasets.
In the former case, your CPU might just be too weak which should also show a low GPU utilization. You can profile the data loading pipeline in isolation by just iterating the DataLoader to check how fast each batch can be loaded and processed using multiple workers.

Dear @ptrblck,

Thank you for your reply. To provide more info, the dataset object holds a list of image file paths and when the dataloader asks for some image, the dataset object will read the image into memory (from hard drive) and then augment and finally return it. There are some tabular feature too but those are already in memory and do not take time to prepare.

I already have the profiling for each step. But regardless of what the profiling tells us (not that it’s contradicting what I’m going to say next), the charts that I’ve shown above tell me that CPU is not busy to process the next batch. As you can see, the left side of the blue box is empty which means the CPU is ideal. But for some reason, the CPU does not start processing the next batch until the GPU is done with the previous one.

I know I’m not providing enough info. Please let me know how I can provide more.