DataLoader parameters have no or little impact to training speed

Hey there, I’m training a DeTR model with around 0.9M images, each paired with a XML file that contains multiple annotations for that image. I upgraded my GPU from GTX 980 Ti to RTX 4090 and only ~doubled performance, when I expected much more. After asking around many have agreed that I’m likely bottlenecked at the data-loading part of my training script.

I’ve initialized the data as torch.utils.data.Datasets, and it is passed to a torch.utils.data.DataLoader. In hopes of removing the bottleneck I’ve now tested out multiple combinations of settings for DataLoader without success.

  • pin_memory set to True and False, seems to make no difference whatsoever
  • num_workers in a range from 0 to 12 (amount of CPU cores in my Intel i7-8700K). So far the best results are with 1.
  • prefetch_factor with setting [null, 2, 8, 64, 256, 1024]. 2 seems to be the quickest, although I have a lot of GPU and CPU RAM available at that setting and training is slow.
  • persist_workers set to True and False. Seems to make no difference.
  • batch_size I’ve kept at 2 for these tests, because that is what the authors used. Each epoch is around 400K steps with this setting.

About the hardware and what happens during training:

  • All the data is stored on a NVMe SSD, from where everything is also installed (>1Gb/s reads and writes). During training I don’t see much stress on the SSD.
  • RAM hovers between 10-64Gb/64Gb at different setups. Only configuration that has capped RAM was num_workers=12 combined with prefetch_factor=256. In those cases RAM loads to around 64GB and then quickly starts to offload, even before ~1000 steps have passed. I don’t understand why data loads up to RAM and slowly is removed from it.
  • CPU is usually at 30-50%. It goes to CPU only in configurations where a lot of data is loaded to RAM and stays there for that period, and then quickly drops back down to 30-50 %. Changing num_workers doesn’t impact CPU.
  • My RTX 4090 is hovering around 50-100 % in spikes, not even close to full capacity. In terms of GPU RAM in use, at this batch_size I only consume between 5-9 Gb. I can go up to batch_size=8 and then GPU RAM is at +20Gb/24Gb.

I would expect the above parameters to improve training speed, but nothing seems to be helping.

Here is a link to what __getitem__ does

Here is an example of the transforms that are done to images at each pass.

Did you profile the code to verify it or is this just a guess?

This is just a guess and I did profile the code afterwards. To be frank from the results I can’t tell much, but maybe someone else is better at interpreting them (below).

For example when I look for TightAnnotationCrop which is done to every single image, the profiler says it’s called only once. read_pascal_voc is used to load the annotations from XML files and that is called for 104K times (once for each batch), taking a total of 12 seconds / 11 hours. But what I don’t understand is that it says cumtime is 63, meaning that the script is done loading XML files after the first minute of operation, before training starts.

Another thing I noticed after posting is that there are indeed 12 Python processes running after setting num_workers to 12, and each process consumes around 0.2Gb RAM, and uses ~1-5% CPU to load something on the disk, at around 0.5 - 5 Mb/s.

Made an album of images with more captions: https://imgur.com/a/KiNm3C9