Bottle neck scaling issues with MultiGPU training

Although memory-wise it may seem like a bottle neck, but I don’t think distributing the load further across the GPUs will benefit computational performance regarding speed – it would help avoid memory bottlenecks though. I think instead of distributing it evenly if memory is a concern, it would probably be even better to use a seperate GPU for loss computation and gradient accumulation (or do that step even on the CPU because copying data across GPUs is expensive). We actually had a discussion about that recently here :slight_smile: Uneven GPU utilization during training backpropagation