I observe significantly different behaviors in returning an iterator from a multiprocessing dataloader between single-GPU training and multi-GPU distributed training.
I’m using a custom map-style dataset, in which I read a large text file and convert to a list of numpy arrarys and keep them in the memory for later indexing. At the same time, I define a custom batch sampler, which pre-generates a list of batch indices and also keep them in memory, yielding a list of sample indices each time.
Here is how I use the DataLoader:
itr=torch.utils.data.DataLoader(memory_dataset, batch_sampler=batch_sampler, collate_fn=Collate(source_vocab.pad_id, target_vocab.pad_id), num_workers=args.num_workers) for epoch in range(max_epoch): for batch in itr: batch=to_cuda(batch) ...
Collate() is a callable object that gathers a list of numpy arrays from
memory_dataset and convert them into CPU tensors.
I run the above code in 1-GPU and 2-GPU settings and vary the number of workers in dataloader. For 2-GPU setting, I use
Here I measure the average time in seconds it takes to initialize
itr as an iterator object for each epoch in different settings:
(1) 1-GPUs, 0-workers: 0.7s (2) 1-GPUs, 1-workers: 0.8s (3) 1-GPUs, 6-workers: 1.1s (4) 2-GPUs, 0-workers: 0.8s (5) 2-GPUs, 1-workers: 37s (6) 2-GPUs, 6-workers: 210s
In PyTorch’s torch.utils.data documentation, it suggests that for an iterable-style dataset, “When num_workers > 0, each worker process will have a different copy of the dataset object”. I assume it’s the same for a map-style dataset in my case, so that explains the (4)(5)(6) results, since more workers requires more copy operations in the main process.
But it’not true in 1-GPU settings, why?
In PyTorch’s torch.multiprocessing documentation, it mentions that "It registers custom reducers, that use shared memory to provide shared views on the same data in different processes. " If this implies that
torch.multiprocessingautomatically creates a shared view of memory across sub-processes for arbitrary types, like numpy.ndarray and Python builtin types, I assume by torch.multiprocessing, each worker of the DataLoader does not hold a different copy of the dataset, but accesses the dataset through a shared memory. So, that explains the (1)(2)(3) results, since no copy is required, the time to initialize different number of workers has nothing to do with the size of the dataset.
But (4)(5)(6) grows almost linearly with the number of workers, why is that?
Overall, I couldn’t figure out why DataLoader behaves so much differently in single-GPU and distributed multi-GPUs configurations.
Any suggestions would be appreciated!