Multi-process data loading and prefetching

claudiacorreia60 · October 11, 2020, 4:55pm

From what I understand the worker processes of the Dataloader fetch batches instead of fetching samples. Is there a way of fetching samples instead of batches?

Also, when setting num_workers > 0, by default each worker prefetches 2 samples in advance. I don’t understand exactly how this works. Each worker prepares a batch and reads 2 samples in advance for the next batch? Do the workers prefetch samples for the next training epoch before it starts?

albanD · October 11, 2020, 5:05pm

Hi,

The Dataloader fetches batches so that it can perform all the preprocessing and creation on the batch on the worker process and have as few things as possible to do in the main process once the batch is ready.
Why would you want workers to load samples only?

Each worker prefetches 2 batches in advance to make sure that when the main process asks for the next batch, there is always one ready.
Note that if you use nightly build, you can control that number with the prefetch_factor argument to the dataloader (doc here: https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader)

claudiacorreia60 · October 11, 2020, 7:21pm

Hi,

Thank you very much for your answer!

Ok, I now understand why the workers fetch batches instead of samples.

Do the workers fetch batches for the next epoch before it starts, or the batches of an epoch only start being fetched when the epoch starts?

On another subject, I noticed that when I choose batch_size=64, worker 1 reads the first 64 indexes, worker 2 reads the next 64 indexes, and so on. Is there a way of having workers reading interleaved indexes?
For example, when num_workers = 3:

worker 1 reads indexes 1, 4, 7, 10…
worker 2 reads indexes 2, 5, 8, 11…
worker 3 reads indexes 3, 6, 9, 12…

Thank you!

albanD · October 12, 2020, 1:32pm

It only start with the epoch, when the iterator is created. Meaning, when you do:

for sample in dataloader:

You can specify a sampler (doc) when you create the Dataloader that is reponsible to draw the samples.
You can control the sampler to force a certain pattern in the content of the batch (and thus what the workers will load).

claudiacorreia60 · October 12, 2020, 10:50pm

Ok, I will try that.

Thank you very much for your help!