Bottle neck scaling issues with MultiGPU training


I am having the problem that as I scale to multiple GPU’s my training is bottlenecked at the capacity of a single GPU. So for example, I am using 8 GPU’s with 16GB of GPU memory per GPU. However, GPU 0 uses approximately it’s full capacity but, the other GPU’s only use 11GB’s. I know that one GPU is designated as the main GPU and requres memory to coordinate but, is there any way to make this more efficient? A printout of my nvidia-smi usage can be found below. Thank you.

You could follow @Thomas_Wolf’s blog post how to use the memory more efficiently. :slight_smile:

Awesome, Thank you. Are there any plans to add this functionality into pytorch itself? I am curious if i should continue balancing loads like this in the future or if till become unnecessary. Thanks again

Although memory-wise it may seem like a bottle neck, but I don’t think distributing the load further across the GPUs will benefit computational performance regarding speed – it would help avoid memory bottlenecks though. I think instead of distributing it evenly if memory is a concern, it would probably be even better to use a seperate GPU for loss computation and gradient accumulation (or do that step even on the CPU because copying data across GPUs is expensive). We actually had a discussion about that recently here :slight_smile: Uneven GPU utilization during training backpropagation