How to understand GPU status and training speed


I have done some experiments on multi-gpu training, and I feel a bit confused about the relationship between gpu status and training speed.
My original expectation was the slowest GPU would be the bottleneck for training speed, since local parameters need to sync in each step. But my experiment result proves I was wrong. Anyone can explain why the busiest GPU doesn’t slow down training speed as expected?

I’m using all_reduce to sync parameters:

for param in model.parameters():
     if param.requires_grad and param.grad is not None:

I measured GPU-Util, and my guess is higher GPU-Util means the GPU is busier and should be slower for training same size of batches. More experiment result for training the same dataset:

test 1: 4 GPU with about 95% GPU-Util - training time is 35 sec
test 2: 2 GPU with 0% GPU-Util, 2 GPU with 90% GPU-Util - training time is 18 sec
test 3: 3 GPU with 0% GPU-Util, 1 GPU with 97% GPU-Util - training time is 15 sec
test 4: 4 GPU with about 0% GPU-Util - training time is 10 sec

If the slowest GPU was the bottleneck, then training time of test 2 and test 3 should be similar as test 1. But how to understand this result?
Please also let me know if you notice any mistake in my experiment.


One reason might be that CUDA GPU shows 100% utilization when running NCCL collective communications, even if it is actually block waiting for other peers to join and doing nothing. So the GPU utilization number cannot faithfully represent how busy a GPU is.

@mrshenli, thanks for reply. Then do you know what could be a better way to check a GPU’s status? For example, to compare training speed on each GPU when using multi-gpu training?

One option might be using nvprof and then visualize the result. It will show time consumed by different comp and comm ops. See the following links:


We are also working on extending autograd profiler to work with DDP, but we don’t have a target date for it yet.

1 Like