Pytorch multi-worker dataloader runs in parallel with training code?

rui_zhang_331 · November 20, 2020, 5:07am

Hi All

we have dataloader and training code works like this way

for fi, batch in enumerate(my_data_loader):
     train()

and in our dataloader, we have define some collate_fn to cook_data

DataLoader(my_dataset,
                                        num_workers=config['num_dataloader_worker'],
                                        batch_size=config['dataloader_batch_size'],
                                        timeout=600,
                                        collate_fn=cook_data
                                        )

My question here is when training is running, can data loader in parallel run in background to do like cook_data, or each “process” will first load/cook data , then run training, so during training, this particular process is basically blocking waiting there?

ptrblck · November 23, 2020, 4:30am

The DataLoader will use multiprocessing to create multiple workers, which will load and process each data sample and add the batch to a queue. It should thus not be blocking the training as long as the queue is filled with batches.

rui_zhang_331 · November 23, 2020, 6:18am

Thanks @ptrblck, so if the training will be different processes than multiprocessing dataloader, right? from your description, if training is slower than dataloading, then basically we should get continuous training and loading time will he shaded?

Also if I use Data parallel, and based on understanding data parallel is using multi threading, so how this multi threading data parallel will work with multi process data loader? still the same way multi process data loader loads the data into queue, and training process(a different process) spin multi threads according to multi GPU to train ?

ptrblck · November 23, 2020, 6:50am

Yes, the main process would execute the training loop, while each worker will be spawned in a new process via multiprocessing. nn.DataParallel and the DataLoader do not interfere with each other.
Also yes, if the loading pipeline is faster then the training, the data loading time would be “hidden”.

rui_zhang_331 · November 23, 2020, 7:28am

Thanks, we test our dataloading without training and it is very fast. But we also put some timestamp on our code and found training time is only part of total time, is that because GPU is aync and our time recoding may not be that accurate?

start_time = datetime.now()
for for loo in range(0, config['epochs']):
    for fi, batch in enumerate(my_data_loader):
        train_time = datetime.now()
        train()
        train_endtime = datetime.now()
total_endtime = datetime.now()

ptrblck · November 23, 2020, 8:41am

CUDA operations are asynchronous, so you won’t capture their runtime and it will be accumulated in the next blocking operation.
You can profile the complete code e.g. with Nsight Systems and check the timeline to narrow down the bottleneck, if your current profiling with timers isn’t giving enough information (or use the PyTorch profiler and create the timeline output).

rui_zhang_331 · November 30, 2020, 8:13pm

Hi @ptrblck, I did some test for dataloader and pipeline, the test looks like this

Experiment 1: With preload data, data pre load into memory

replay_mem = {}
for fi, batch in enumerate(my_data_loader):
      replay_mem[fi] = batch

# Training with all data in memory

for i in range(0,epoch)
    for fi, batch in enumerate(replay_memory.items()):
     Train(batch)

Experiment 2: Without preload data, the data is loaded via dataloader

for i in range(0,epoch)
    for fi, batch in enumerate(ny_data_loader):
     Train(batch)

The experiment 1 has less time, and time diff compared with experiment 2 is about the data loading time(if I just test the data loading). I imagine if we have pipeline that training just consumes multi worker data loader, if training time is higher than multi worker data loader, so data loader time will be completely shaded? (data loading is always faster and thus training always have data)?

ptrblck · December 1, 2020, 5:05am

Yes, if the model training takes more time than loading and processing the next batch, the data loading time will be hidden and comes “for free” (The next epoch would create new workers, which would have to start creating new batches in the default setup. You could use persistent_workers=True to avoid this behavior).