How to cache an entire dataset in multiprocessing?

jdenize · October 11, 2021, 12:02pm

Thank you for your reply. I am indeed using these tools, however, caching the subsets in each process won’t work for the next epochs I think because the distributed sampler will shuffle the indices across the replicas.

I have enough RAM to store imagenet but I agree that it is a specific use case for large enough clusters. I would like to store the dataset before doing the augmentations, maybe I should try to pass the dataset from the main process to all the processes spawned instead of loading the dataset in each process. I’m unfamiliar with the multiprocessing library but I would guess that there should be a way to share the data before or after the different processes are spawned.

I’ve seen an answer you provided a few years ago: How to share data among DataLoader processes to save memory. I think what I’m looking for is close to the solution you provided except that I don’t want a numpy array that requires images of the same size which is not the case for brute imagenet. So I need to find a way to share a list of tensors to all the different processes.