Hey there, I’m training a DeTR model with around 0.9M images, each paired with a XML file that contains multiple annotations for that image. I upgraded my GPU from GTX 980 Ti to RTX 4090 and only ~doubled performance, when I expected much more. After asking around many have agreed that I’m likely bottlenecked at the data-loading part of my training script.
I’ve initialized the data as
torch.utils.data.Datasets, and it is passed to a
torch.utils.data.DataLoader. In hopes of removing the bottleneck I’ve now tested out multiple combinations of settings for
DataLoader without success.
pin_memoryset to True and False, seems to make no difference whatsoever
num_workersin a range from 0 to 12 (amount of CPU cores in my Intel i7-8700K). So far the best results are with 1.
prefetch_factorwith setting [null, 2, 8, 64, 256, 1024]. 2 seems to be the quickest, although I have a lot of GPU and CPU RAM available at that setting and training is slow.
persist_workersset to True and False. Seems to make no difference.
batch_sizeI’ve kept at 2 for these tests, because that is what the authors used. Each epoch is around 400K steps with this setting.
About the hardware and what happens during training:
- All the data is stored on a NVMe SSD, from where everything is also installed (>1Gb/s reads and writes). During training I don’t see much stress on the SSD.
- RAM hovers between 10-64Gb/64Gb at different setups. Only configuration that has capped RAM was
prefetch_factor=256. In those cases RAM loads to around 64GB and then quickly starts to offload, even before ~1000 steps have passed. I don’t understand why data loads up to RAM and slowly is removed from it.
- CPU is usually at 30-50%. It goes to CPU only in configurations where a lot of data is loaded to RAM and stays there for that period, and then quickly drops back down to 30-50 %. Changing
num_workersdoesn’t impact CPU.
- My RTX 4090 is hovering around 50-100 % in spikes, not even close to full capacity. In terms of GPU RAM in use, at this
batch_sizeI only consume between 5-9 Gb. I can go up to
batch_size=8and then GPU RAM is at +20Gb/24Gb.
I would expect the above parameters to improve training speed, but nothing seems to be helping.
Here is a link to what __getitem__ does
Here is an example of the transforms that are done to images at each pass.