How to cache an entire dataset in multiprocessing?

jdenize · October 10, 2021, 8:16am

Hi,

I would like to train a model on imagenet, but with the default ImageFolder, it is taking too long to train.

To speed up the training process, I want to cache the dataset in RAM. I’ve seen that one way to cache the dataset is to create a large numpy tensor or a list of tensors such as what is done in small datasets (Cifar, Mnist) by torchvision.datasets module.

However, I was wondering how to do that in multiprocessing with distributed data-parallel, because how I understood it is that each process will create an object dataset and I risk duplicating the RAM usage for each process started. Is there a way to cache only once the list of tensors that will then be shared to each process ?

ptrblck · October 11, 2021, 4:49am

Assuming you are using a DistributedSampler in your DistributedDataParallel use case you could try to cache the sbsets only in each process.
However, note that ImageNet is a large dataset so you would need to use a lot of host memory to store it. I’m not familiar with your approach, but in case you want to create caches and directly reuse them, note that data augmentation would most likely be disabled (unless you want to transform the samples again), which might hurt your training performance.

jdenize · October 11, 2021, 12:02pm

Thank you for your reply. I am indeed using these tools, however, caching the subsets in each process won’t work for the next epochs I think because the distributed sampler will shuffle the indices across the replicas.

I have enough RAM to store imagenet but I agree that it is a specific use case for large enough clusters. I would like to store the dataset before doing the augmentations, maybe I should try to pass the dataset from the main process to all the processes spawned instead of loading the dataset in each process. I’m unfamiliar with the multiprocessing library but I would guess that there should be a way to share the data before or after the different processes are spawned.

I’ve seen an answer you provided a few years ago: How to share data among DataLoader processes to save memory. I think what I’m looking for is close to the solution you provided except that I don’t want a numpy array that requires images of the same size which is not the case for brute imagenet. So I need to find a way to share a list of tensors to all the different processes.

ptrblck · October 11, 2021, 7:13pm

Yes, I think you are right regarding the shuffling in DistributedSampler.
If you want to share arrays with a different shape you might want to check the shared_dict implementation and see if this would be a valid approach.

jdenize · October 12, 2021, 3:16pm

Thank you, I will definitely look at this and mark it as a solution because I should be able to do what I want using this object.

jrcavani · October 4, 2023, 3:11am

This thread has some links that captured the solutions well. Currently the DistributedSampler code still creates Python lists that will copy on read and cause big memory usage.