Pytorch multi-worker dataloader runs in parallel with training code?

Hi All

we have dataloader and training code works like this way

for fi, batch in enumerate(my_data_loader):

and in our dataloader, we have define some collate_fn to cook_data


My question here is when training is running, can data loader in parallel run in background to do like cook_data, or each “process” will first load/cook data , then run training, so during training, this particular process is basically blocking waiting there?

The DataLoader will use multiprocessing to create multiple workers, which will load and process each data sample and add the batch to a queue. It should thus not be blocking the training as long as the queue is filled with batches.

Thanks @ptrblck, so if the training will be different processes than multiprocessing dataloader, right? from your description, if training is slower than dataloading, then basically we should get continuous training and loading time will he shaded?

Also if I use Data parallel, and based on understanding data parallel is using multi threading, so how this multi threading data parallel will work with multi process data loader? still the same way multi process data loader loads the data into queue, and training process(a different process) spin multi threads according to multi GPU to train ?

Yes, the main process would execute the training loop, while each worker will be spawned in a new process via multiprocessing. nn.DataParallel and the DataLoader do not interfere with each other.
Also yes, if the loading pipeline is faster then the training, the data loading time would be “hidden”.

Thanks, we test our dataloading without training and it is very fast. But we also put some timestamp on our code and found training time is only part of total time, is that because GPU is aync and our time recoding may not be that accurate?

start_time =
for for loo in range(0, config['epochs']):
    for fi, batch in enumerate(my_data_loader):
        train_time =
        train_endtime =
total_endtime =

CUDA operations are asynchronous, so you won’t capture their runtime and it will be accumulated in the next blocking operation.
You can profile the complete code e.g. with Nsight Systems and check the timeline to narrow down the bottleneck, if your current profiling with timers isn’t giving enough information (or use the PyTorch profiler and create the timeline output).

1 Like

Hi @ptrblck, I did some test for dataloader and pipeline, the test looks like this

  • Experiment 1: With preload data, data pre load into memory
replay_mem = {}
for fi, batch in enumerate(my_data_loader):
      replay_mem[fi] = batch

# Training with all data in memory

for i in range(0,epoch)
    for fi, batch in enumerate(replay_memory.items()):
  • Experiment 2: Without preload data, the data is loaded via dataloader
for i in range(0,epoch)
    for fi, batch in enumerate(ny_data_loader):

The experiment 1 has less time, and time diff compared with experiment 2 is about the data loading time(if I just test the data loading). I imagine if we have pipeline that training just consumes multi worker data loader, if training time is higher than multi worker data loader, so data loader time will be completely shaded? (data loading is always faster and thus training always have data)?

Yes, if the model training takes more time than loading and processing the next batch, the data loading time will be hidden and comes “for free” (The next epoch would create new workers, which would have to start creating new batches in the default setup. You could use persistent_workers=True to avoid this behavior).