I have 512GB data that can be loaded into the pin memory. The pin memory works great if I only use 1 GPU: the pin memory is fast enough to transfer data to cuda so that the GPU is 100% busy (which is not the case when data are not pin memed) . However, when I tried to launch 4 training processes using torchrun to fully use the 4 GPUs I have on one node, this is a problem: I needed 4 copies of the 512GB data in the pin memory which I don’t have.
Currently, my work around is to turn off the shuffle in the dataloader, so that the dataloader of each torchrun-lauched process only loads their own share of data on-demand (so in total still 512 GB).
The question is: is there a way to preload the dataset into pin memory and share the pin memory among all the torchrun-launched processes?
Thank you!