DataParallel vs increasing # workers in data loader

I’m trying to understand what the differences are in using DataParallel vs increasing the num_workers in the DataLoader.

It seems that DataParallel divides the batch uniformly across the available GPUs, allowing the forward and backward passes to be done on each split up batch in parallel.

But what does increasing the num_workers in DataLoader do? That is, does each process generated to consume a new batch?

If num_worker is > 0 then that much amount of separated processes will be spawned to do the data loading job. Each process will generate a single batch. This prevents bottleneck due to dataloading as multiple processes are working on it compared to num_workers=0 where after the forward pass gpu waits for the next batch of data to be loaded.