Hi there.
The torch.utils.data documentation says that
When num_workers > 0
, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers.
However, on this other topic, @ ptrblck says that
If you use multiple workers in your DataLoader
, each worker will load a batch in the background using multiprocessing while the GPU is busy.
My question is: what data does each worker process hold? Does it hold the full dataset object or only a batch of it?
Each worker will hold a reference to an own Dataset
.
Now, if you are using a lazy loading approach in the Dataset
, the actual copy will be small, since basically only the Dataset
object itself and everything in its __init__
will be copied in each worker, which is usually just a reference to a transformation, the image paths etc.
However, if you are eagerly loading the data, i.e. you are pre-loading the entire data in the Dataset.__init__
, then each worker will also create a copy of it and will thus increase the overall memory usage significantly.
In both approaches the workers will then create a batch by calling the Dataset.__getitem__
using the indices from the sampler.
2 Likes