How to read file only once in DDP

hcrot · January 4, 2024, 4:31am

When I run the training code for DDP, each process reads the dataset, which comes down to N times file read. When the file size is too large, I run out of memory (not GPU memory but CPU’s).

Are there any better ways to load the dataset file in the memory-efficient way?
(To be clear, I’m not talking about the dataloader class when I say ‘load’ - I’m saying about the pre-stage for dataloading - reading the dataset file from disk.)

ptrblck · January 4, 2024, 12:54pm

If you are lazily loading samples you would use a DistributedRandomSampler and would thus load only the samples assigned to the corresponding rank.