I have a very big file list, which is organized with:
And I create a dataset to read this file list to the memory.
Since my training code is run with DistributedDataParallel and I have 8 GPUs, the dataset will be created 8 times.
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" train.py
And they will cost large memories, nearly 30*8=240G in total. Is there a way to let those processes share a single dataset?
Thanks for your help
Can you use PyTorch DataLoader? If you implement the
__getitem__ function, the batches will be lazily read into memory. Each DDP replica will then have one DataLoader, and each DataLoader will load the data lazily, so there shouldn’t be as much memory pressure.
Relevant Forums Post: How to use dataset larger than memory?
Thanks for your reply.
I do use pytorch dataloader like below:
class MyTrainImageFolder(torch.utils.data.Dataset): # Class inheritance
def __init__(self, root, file_dir, label_dir, transform=MYTFS):
imgs = 
labels = 
img_list = np.load(file_dir)
labels_list = np.load(label_dir)
self.imgs = img_list
self.labels = labels_list
self.root = root
self.transform = transform
self.lens = len(img_list)
print('total img number:', len(self.imgs))
def __getitem__(self, index):
label = torch.tensor(int(self.labels[index]))
img = Image.open(os.path.join(self.root, self.imgs[index]))
img = self.transform(img)
return img, label
The problem is, the file in “file_dir” is about 30G, I have more than 200 million images. Can i build a dataset once when using distributedDataParallel?
Maybe I should split the big list file into some small patches, and using a for loop in my training process? That is the best way that I could find.
One thing I can think of is to split the file into smaller patches, and instead of loading the files in the
__init__ function, you can load these smaller files in the
__getitem__ function itself (using the index and number of examples per file to fetch the correct file). This way you avoid loading the massive file all at once from all the ranks. I haven’t profiled this performance-wise, though you will be doing 2 disk reads instead of one in the getitem function - one for the images and one for the list/labels file. However, you might benefit from some caching with the latter depending on the file size/batch size/how you sample from the dataset.
I’m not sure if there some shared-memory based approach where we can load these files into some memory that is shared by all the processes, but I can try to dig more into this approach if the above one does not work.