The torch.utils.data documentation says that
num_workers > 0 , each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers.
However, on this other topic, @ ptrblck says that
If you use multiple workers in your
DataLoader, each worker will load a batch in the background using multiprocessing while the GPU is busy.
My question is: what data does each worker process hold? Does it hold the full dataset object or only a batch of it?
Each worker will hold a reference to an own
Now, if you are using a lazy loading approach in the
Dataset, the actual copy will be small, since basically only the
Dataset object itself and everything in its
__init__ will be copied in each worker, which is usually just a reference to a transformation, the image paths etc.
However, if you are eagerly loading the data, i.e. you are pre-loading the entire data in the
Dataset.__init__, then each worker will also create a copy of it and will thus increase the overall memory usage significantly.
In both approaches the workers will then create a batch by calling the
Dataset.__getitem__ using the indices from the sampler.