Multi-GPU Dataloader and multi-GPU Batch?

The usual work flow would be to try to load the next batch into the RAM while the GPUs are busy training the model to hide the latency of the data loading step.
In each iteration you would push a batch to the device and start loading the next batch.

Yes, that’s correct. In order to execute the wanted operation (e.g. a matrix multiplication), the parameters (data and weight) need to be on the same device.

Generally yes. However, you usually don’t need to care about the actual data scattering and gathering as DDP will do this all for you.
You would have to specify a default device (e.g. GPU0), push the data onto this default device, and execute your training. DDP should automatically scatter the provided data in dim0 to all devices, as well as take care of all synchronization.

Multiple DataLoaders might make sense for specific use cases, but you would have to check your system resources and see, if this would create a bottleneck (loading data from SSD, preprocessing on CPU etc.).
I would recommend to use a single DataLoader and DDP first as given in the liked tutorial.

nn.DataParallel and DDP will split the batch in dim0 and scatter each chunk to the corresponding device.
This tutorial gives you a good example.

Let me know, if something is unclear. :wink:

1 Like