What data does each worker process hold? Does it hold the full dataset object or only a batch of it?

clarissesimoes · August 26, 2022, 6:35pm

Hi there.

The torch.utils.data documentation says that

When num_workers > 0 , each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers.

However, on this other topic, @ ptrblck says that

If you use multiple workers in your DataLoader, each worker will load a batch in the background using multiprocessing while the GPU is busy.

My question is: what data does each worker process hold? Does it hold the full dataset object or only a batch of it?

ptrblck · August 26, 2022, 6:39pm

Each worker will hold a reference to an own Dataset.
Now, if you are using a lazy loading approach in the Dataset, the actual copy will be small, since basically only the Dataset object itself and everything in its __init__ will be copied in each worker, which is usually just a reference to a transformation, the image paths etc.

However, if you are eagerly loading the data, i.e. you are pre-loading the entire data in the Dataset.__init__, then each worker will also create a copy of it and will thus increase the overall memory usage significantly.

In both approaches the workers will then create a batch by calling the Dataset.__getitem__ using the indices from the sampler.