PyTorch’s data loader uses multiprocessing in Python and each process gets a replica of the dataset. When the dataset is huge, this data replication leads to memory issues.
Normally, multiple processes should use shared memory to share data (unlike threads). I wonder if there is an easy way to share the common data across all the data loading worker processes in PyTorch. Maybe someone has already coded this (I could not find yet).
If you are lazily loading the data (which is the common use case, if you are dealing with large datasets), the memory overhead from the copies might be small in comparison to the overall memory usage in the script.
That being said, you could try to use shared arrays as described here instead.
I am facing a similar problem to the one mentioned here. But in my case I want to share a class object (Tree structure object “very large tree”) between workers. I see that Python multiprocessing only supports sharing arrays. Is there a way to share different objects between workers?
Hi @ptrblck I know this thread might be dated, but I wanted to second @Pietro_Cicalese 's observation in the proposed approach (the very last paragraph).
I also observed significant overhead when using built-in Queue approach for multi-processing data loading, predominantly coming from the fact that ConnectionWrapper unpickles the received byte array in here. I see that Connection requires recv to return something pickleable, but byte array is also pickleable. Or is it just an intermediary containing the fd handle/size? Also multiple connections are being established between the processes each requiring to pass the answer_challenge.
That and the fact using multiple smaller tensors in the batch being transfer seem to exacerbate the issue is the reason I wanted to ask (in case you know):
What’s the recommended way to share larger sample batches made of multiple tensors using the built-in tools? Or is the only option to build a custom shared-memory based file sharing solution using single producer, single consumer style?
Here’s an excerpt from a profiling session for 50 steps, same number of batches in this case each containing 512 samples: