Hi,
I have done some experiments on multi-gpu training, and I feel a bit confused about the relationship between gpu status and training speed.
My original expectation was the slowest GPU would be the bottleneck for training speed, since local parameters need to sync in each step. But my experiment result proves I was wrong. Anyone can explain why the busiest GPU doesn’t slow down training speed as expected?
I’m using all_reduce to sync parameters:
for param in model.parameters():
if param.requires_grad and param.grad is not None:
torch.distributed.all_reduce(param.grad.data,
op=torch.distributed.ReduceOp.SUM)
I measured GPU-Util, and my guess is higher GPU-Util means the GPU is busier and should be slower for training same size of batches. More experiment result for training the same dataset:
test 1: 4 GPU with about 95% GPU-Util - training time is 35 sec
test 2: 2 GPU with 0% GPU-Util, 2 GPU with 90% GPU-Util - training time is 18 sec
test 3: 3 GPU with 0% GPU-Util, 1 GPU with 97% GPU-Util - training time is 15 sec
test 4: 4 GPU with about 0% GPU-Util - training time is 10 sec
If the slowest GPU was the bottleneck, then training time of test 2 and test 3 should be similar as test 1. But how to understand this result?
Please also let me know if you notice any mistake in my experiment.
Thanks.