Fastest way to train n models on n interconnected GPUs with batch sizes of 1?

I’m trying to train 8 models, all with a batch size of 1, on a DGX-A100 node with 8 GPUs on it. What is the fastest way to execute this?

I assume all 8 models are trained in an independent way so I would probably just launch 8 training scripts masking each device via CUDA_VISIBLE_DEVICES.

But is there a way to do this while not make each training script run n-times slower?

This approach wouldn’t slow down your training by N by design as each GPU can execute its operations independently. If you are seeing a slowdown this would point towards a CPU or IO bottleneck as either the CPU isn’t capable to launch the work for all N devices or isn’t able to feed the data fast enough.