Multi-GPU Dataloader and multi-GPU Batch?

I’ve followed the nn.DataParallel turorial and placed on CPU RAM my pre-processed dataset, which then uploads during the training to a generic cuda letting the mechanism assign the tensors.

I’m using an AWS p2.8xlarge with 8x K80 GPUs and what has shocked me is how slow it is. The equivalent on my 2080Ti is magnitudes faster. I’ve pinned the memory, but I suppose the copy/upload is expensive. It also has to use cuda().float() rather than half().

I’ve also tried on a p3.8xlarge which has 4x V100, whilst considerably faster because I’m using half precision, to my surprise the GPUs are only used up to 40%, and usually average less than that.

Using the a p3.xlarge with 1x V100 GPU I took the same approach: uploaded all tensors on CPU RAM (pinned) and then run the same exact code (only now there’s one GPU). The GPU was used on average 86% and had about 2/5 of the memory occupied by the model and batch size.

Finally, I did the comparison of CPU-to-GPU and GPU-only using with my own 2080Ti, only I can’t fit the entire data-set in the GPU (hence why I first started looking into multi-GPU allocated data-loaders).

My approach is simple, I run the hard disk loading at construction, transform the images to tensors, and then upload all of them on the GPU rather than the CPU. This requires that the model, batch and data-set will fit on the GPU memory.

The comparison isn’t fair: Here I’m using less samples per class, whereas before I was using 5000 samples per class. In the case of AWS I had 50,000 total samples, whereas on the 2080Ti I could only upload 20,000 samples before running out of GPU RAM.

However; the GPU is used at 100% at all times, and the GPU RAM is used at 95%-98%. In the case of the 4x V100 (or even the 1x V100 when using CPU RAM) there is a stark difference in how fast training takes place. There are no copies between CPU-to-GPU and this offers a huge advantage. I understand that if I was to use larger models I’d have issues, or if I used larger batches I might run out of memory.

I also understand that the mechanisms involved in data synchronisation are complex (GPU to GPU is synchronous compared to CPU to GPU which is async).

Finally some results:
4x V100 took: 0:32:51 to run 50 epochs at 128 batch size (50,000 samples in total) from CPU-to-GPU
1x V100 took: 0:36:44 to run 50 epochs at 128 batch size (50,000 samples in total) from CPU-to-GPU
1x 2080Ti took: 0:19:44 to run 50 epochs at 128 batch size (20,000 samples in total) from GPU-only

So I am wondering if there is something I can do, to achieve orchestration of data split across multiple-GPUs. I imagine one solution would be to extend nn.DataParallel so that if multiple GPUs are present, each batch (or multiple batches) run in parallel, with model copies across all GPUs. This would require some form of fusing the models at the end of the batches.

Another approach would be to do GPU-to-GPU copies in order to ensure that the batch tensors and model are on the same device. This may become a bottle neck.
If this is not the place to have this conversation, I am willing to contribute to PyTorch in order to work on this, as I believe would be very beneficial for people who work with smaller data-sets (smaller is relative, as the 4x V100 will hold 64Gb of data excluding the model size and batch data).