How to share CPU memory in distributed training?

Hello, I am trying to train ImageNet on a 8GPU machine with DDP mode. However, my machine is not good at reading large scale small files. I have to make a tar file of the whole dataset (130Gb), read the tar file into memory and extract them in memory. I have a CPU memory of 360Gb. So it would be OK to use DataParallel mode. But it seems I cannot use DistributedDataParallel since I need to load the dataset 8 times. Is there any method that I can train the model with DDP mode?

Thanks!

One option is to use torch.multiprocessing.Queue as the shared memory. The main process can prepare multiple queues and then pass one queue to each DDP processes. The main process reads from the file and dispatch data items to the queue, while DDP processes wait on their own queue for that data item.

Another option is to split the tar file into multiple smaller pieces and let each DDP process read a different one.

1 Like

Hello! Thanks for your answer! I have one more question about the second option. Does that mean something like this?

I split the tar file into data_1.tar, data_2.tar, …, data_8.tar. For the k^{th} GPU, i.e., local_rank = k, the process read data_k.tar and build data loader with data_k.tar. Then I get 8 different data loaders (their data are different). In this case, I guess I should set shuffle=True and do not need a train sampler?

I split the tar file into data_1.tar, data_2.tar, …, data_8.tar. For the k^{th} GPU, i.e., local_rank = k, the process read data_k.tar and build data loader with data_k.tar. Then I get 8 different data loaders (their data are different). In this case, I guess I should set shuffle=True and do not need a train sampler?

Yes, but one caveat is that those input data splits need to generate the same number of input batches for DDP. If, say rank 0 processes 3 batches and rank 1 process 4 batches, rank 1 would hang on the last batch.

1 Like