DataLoader behaves differently in the time when initializing an iterator in single-GPU and distributed multi-GPUs configurations

Hi,

Problem

I observe significantly different behaviors in returning an iterator from a multiprocessing dataloader between single-GPU training and multi-GPU distributed training.

Background

I’m using a custom map-style dataset, in which I read a large text file and convert to a list of numpy arrarys and keep them in the memory for later indexing. At the same time, I define a custom batch sampler, which pre-generates a list of batch indices and also keep them in memory, yielding a list of sample indices each time.

Here is how I use the DataLoader:

    itr=torch.utils.data.DataLoader(memory_dataset, batch_sampler=batch_sampler, collate_fn=Collate(source_vocab.pad_id, target_vocab.pad_id), num_workers=args.num_workers)

    for epoch in range(max_epoch):
        for batch in itr:
            batch=to_cuda(batch)
            ...

The Collate() is a callable object that gathers a list of numpy arrays from memory_dataset and convert them into CPU tensors.

Profiling:

I run the above code in 1-GPU and 2-GPU settings and vary the number of workers in dataloader. For 2-GPU setting, I use DistributedDataParallel.

Here I measure the average time in seconds it takes to initialize itr as an iterator object for each epoch in different settings:

(1) 1-GPUs, 0-workers: 0.7s
(2) 1-GPUs, 1-workers: 0.8s
(3) 1-GPUs, 6-workers: 1.1s
(4) 2-GPUs, 0-workers: 0.8s
(5) 2-GPUs, 1-workers: 37s
(6) 2-GPUs, 6-workers: 210s

Questions

  1. In PyTorch’s torch.utils.data documentation, it suggests that for an iterable-style dataset, “When num_workers > 0, each worker process will have a different copy of the dataset object”. I assume it’s the same for a map-style dataset in my case, so that explains the (4)(5)(6) results, since more workers requires more copy operations in the main process.
    But it’not true in 1-GPU settings, why?

  2. In PyTorch’s torch.multiprocessing documentation, it mentions that "It registers custom reducers, that use shared memory to provide shared views on the same data in different processes. " If this implies that torch.multiprocessing automatically creates a shared view of memory across sub-processes for arbitrary types, like numpy.ndarray and Python builtin types, I assume by torch.multiprocessing, each worker of the DataLoader does not hold a different copy of the dataset, but accesses the dataset through a shared memory. So, that explains the (1)(2)(3) results, since no copy is required, the time to initialize different number of workers has nothing to do with the size of the dataset.
    But (4)(5)(6) grows almost linearly with the number of workers, why is that?

  3. Overall, I couldn’t figure out why DataLoader behaves so much differently in single-GPU and distributed multi-GPUs configurations.

Any suggestions would be appreciated!

Is memro_dataset loading the data in its __init__ or does it already contain all the data?
Each worker would create a copy of this dataset, which would explain the slight increase in 1, 2, 3.

However, each process in the DDP setup would create the specified number of workers, so that you would end up with more workers in the end. I don’t know, why the creation takes so much more time though.

Thanks for your attention. memory_dataset loads all the data in its __init__ function. And I also expected the same performance in either single-GPU case or DDP multi-GPU case.