Shared Pin Memory

I have 512GB data that can be loaded into the pin memory. The pin memory works great if I only use 1 GPU: the pin memory is fast enough to transfer data to cuda so that the GPU is 100% busy (which is not the case when data are not pin memed) . However, when I tried to launch 4 training processes using torchrun to fully use the 4 GPUs I have on one node, this is a problem: I needed 4 copies of the 512GB data in the pin memory which I don’t have.

Currently, my work around is to turn off the shuffle in the dataloader, so that the dataloader of each torchrun-lauched process only loads their own share of data on-demand (so in total still 512 GB).

The question is: is there a way to preload the dataset into pin memory and share the pin memory among all the torchrun-launched processes?

Thank you!

I can imagine this to be a popular usecase in dataloader. I am quite noob in data loader. In principle, each process has its own memory space and is hard to share pinned memory

the closest solution might be shared_memory_ in Multiprocessing package - torch.multiprocessing — PyTorch 2.5 documentation