Sharing dataset between subprocess when using DistributedDataParallel

I have a very big file list, which is organized with:

And I create a dataset to read this file list to the memory.
Since my training code is run with DistributedDataParallel and I have 8 GPUs, the dataset will be created 8 times.

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=""

And they will cost large memories, nearly 30*8=240G in total. Is there a way to let those processes share a single dataset?

Thanks for your help

Can you use PyTorch DataLoader? If you implement the __getitem__ function, the batches will be lazily read into memory. Each DDP replica will then have one DataLoader, and each DataLoader will load the data lazily, so there shouldn’t be as much memory pressure.

Relevant Forums Post: How to use dataset larger than memory?

Thanks for your reply.
I do use pytorch dataloader like below:

class MyTrainImageFolder(  # Class inheritance
    def __init__(self, root, file_dir, label_dir, transform=MYTFS):
        imgs = []
        labels = []
        img_list = np.load(file_dir)
        labels_list = np.load(label_dir)

        self.imgs = img_list
        self.labels = labels_list
        self.root = root
        self.transform = transform
        self.lens = len(img_list)
        print('total img number:', len(self.imgs))

    def __getitem__(self, index):
        label = torch.tensor(int(self.labels[index]))
        img =, self.imgs[index]))
        img = self.transform(img)

        return img, label

    def __len__(self):
        return len(self.imgs)

The problem is, the file in “file_dir” is about 30G, I have more than 200 million images. Can i build a dataset once when using distributedDataParallel?

Maybe I should split the big list file into some small patches, and using a for loop in my training process? That is the best way that I could find.

One thing I can think of is to split the file into smaller patches, and instead of loading the files in the __init__ function, you can load these smaller files in the __getitem__ function itself (using the index and number of examples per file to fetch the correct file). This way you avoid loading the massive file all at once from all the ranks. I haven’t profiled this performance-wise, though you will be doing 2 disk reads instead of one in the getitem function - one for the images and one for the list/labels file. However, you might benefit from some caching with the latter depending on the file size/batch size/how you sample from the dataset.

I’m not sure if there some shared-memory based approach where we can load these files into some memory that is shared by all the processes, but I can try to dig more into this approach if the above one does not work.