I have a very big file list, which is organized with:
[
[filename,label],
[filename,label],
…
[filename,label]
]
And I create a dataset to read this file list to the memory.
Since my training code is run with DistributedDataParallel and I have 8 GPUs, the dataset will be created 8 times.
Can you use PyTorch DataLoader? If you implement the __getitem__ function, the batches will be lazily read into memory. Each DDP replica will then have one DataLoader, and each DataLoader will load the data lazily, so there shouldn’t be as much memory pressure.
The problem is, the file in “file_dir” is about 30G, I have more than 200 million images. Can i build a dataset once when using distributedDataParallel?
One thing I can think of is to split the file into smaller patches, and instead of loading the files in the __init__ function, you can load these smaller files in the __getitem__ function itself (using the index and number of examples per file to fetch the correct file). This way you avoid loading the massive file all at once from all the ranks. I haven’t profiled this performance-wise, though you will be doing 2 disk reads instead of one in the getitem function - one for the images and one for the list/labels file. However, you might benefit from some caching with the latter depending on the file size/batch size/how you sample from the dataset.
I’m not sure if there some shared-memory based approach where we can load these files into some memory that is shared by all the processes, but I can try to dig more into this approach if the above one does not work.