Distributed training is even slower sometimes

Used DistributedDataParallel, but find that the speed with 4 nodes (4 GPUs per node) is sometimes even slower. Sometimes, the job runs 4300 images for each second, which is normal. If it is 4300 at the beginning, the job will always be running at this fast speed. But sometimes, the job is running at 1000 images per second, and the whole job will be at this speed always. The jobs are running in a cluster, and in different physical machines but same machine type.

For the job with problematic issues, the GPU utility is always 98%~100%. Pytorch version = 1.4; CUDA=10.1; Ubuntu 16.04 docker image. NCCL is for sure to use the Infinity band with the following logs. The time cost of data loading is also very small (less than 1%).

Are there any idea to debug?

41de615398b349e78486287e94d4883b000000:1573:1573 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB eth0:10.0.0.4<0>

Do the jobs always run on the exactly same set of machines?

If not, can there be any straggler in cluster? Or can different network layouts (same rack, different rack, etc.) play a role here?

For debugging, it will be helpful to identify which step (forward, backward, opt.step) takes longer (on all nodes) when the throughput drops to 1000 images/s. elapsed_time should be able to tell that. All communication in DDP occurs in the backward pass. If all steps are the same but the backward pass takes long for all processes, then it might be caused by network issues. If some processes suffers from slower data loading, or slower forward, or slower optimizer, it looks more like a straggler problem.

Thanks for your reply. I checked again. They are not running on the same machines. For the problematic job, i killed it and re-run it. The job was scheduled on those 4 machines again and the speed is still 1000. Then, i submit the same job again, which are scheduled on another 4 machines. The new job is running 4k. So, the problem might be the issue of machines or rack as you suggested.

One more question is that if the network has some issues on those machines or straggler issues, would it be possible that the GPU is still utilizing 98%~100%? The GPUs are fully utilized and i was thinking there is no network issue.

Not 100% sure, but if GPU reports block waiting for AllReduce as busy, then slow network or straggler could lead to 100% utilization for the entire group. This can be measured by submitting a job that only runs allreduce. BTW, based one past observations, GPU would (sometimes?) report 100% utilization even if DDP hangs. So I think this is possible.

cc @ngimel in case you know how CUDA would behave here :slight_smile:

@mrshenli is correct, if there are straggler GPUs, other GPUs waiting for them would report 100% utilization with AllReduce.

1 Like

Thanks very much @mrshenli @ngimel for the explanation of ALLReduce leading 100%. One more question is, is there any way to detect which GPU is the straggler (among 16 GPUs)?