How to share data among DataLoader processes to save memory

PyTorch’s data loader uses multiprocessing in Python and each process gets a replica of the dataset. When the dataset is huge, this data replication leads to memory issues.

Normally, multiple processes should use shared memory to share data (unlike threads). I wonder if there is an easy way to share the common data across all the data loading worker processes in PyTorch. Maybe someone has already coded this (I could not find yet).


If you are lazily loading the data (which is the common use case, if you are dealing with large datasets), the memory overhead from the copies might be small in comparison to the overall memory usage in the script.
That being said, you could try to use shared arrays as described here instead.