DataLoader relationship between num_workers, prefetch_factor and type of DataSet

I’m slowly wrapping my head around proper use of a custom DataSet as used by the DataLoader.

I have a torch.utils.data.Dataset which utilizes a:

def __getitem__(self, idx):
    ...
    return { "image": X_ims, "label": Y}

when I supply the DataLoader with a batch_size I get the expected number of samples back. But the documentation is a bit unclear with regard to what the workers are doing. The documentation mentions that they are getting samples from the DataSet, and I am “assuming” that the prefetch_factor is the number of samples they are retrieving. Do they do this until another entire batch is loaded ( ideally while my model is training)? or is each worker trying to collect a full batch of samples?

The question arises because I find I am often stuck waiting on DataLoader’s
_next_data() > _get_data() > _try_get_data() > _get() > _wait() method. Which I take to mean that it is waiting for the workers to get back with all the samples.

If I have a large batch size ( 1000 ), assuming I have num_workers=2 should I set the prefetch_factor=500? In the hope each worker is able to get enough samples for the next batch by the time I’m ready for it?

Each worker will load a complete batch in the background (if num_workers>0) while the training is executed. There is a feature request to allow multiple workers to work on a single batch, but I don’t think it’s implemented yet.

The prefetch_factor defines the number of batches, which are preloaded, if I’m not mistaken, so 500 would be quite large (it could be alright, if you have enough memory).

I see that the recent version rather seems to load prefetch_factor*num_workers “samples”.

prefetch_factor (int, optional, keyword-only arg): Number of samples loaded
in advance by each worker. 2 means there will be a total of
2 * num_workers samples prefetched across all workers. (default: 2)

However, using different prefetch_factor values did not absolutely change the used GPU memory for my pipeline. But not sure if it is due to the customized dataloader or another issue with this newer pytorch functionality (hoping to spend more time on this soon, but would appreciate any feedback if someone happens to stop by to look at this).

I wouldn’t expect it to change the GPU memory usage, since the data is preloaded on the host and each batch is pushed to the device in the training loop in the common use case. This also seems to be your workflow, as I cannot see any GPU usage in the MRIDataset.

Thanks much! I was assuming this functionality enables data to be prefetched to GPU (via CPU), but looks like from your explanation it is limited to CPU memory (and that is rather what I must monitor for noting performance gains). Although this functionality would be helpful, I guess it would still result in a squarish waveform for GPU utilization (idle times). I have actually tried calling cuda within getitem in the past with little success (I remember the code threw errors but might try again).