Allreduce takes 100% gpu utilization if it is getting

As GPU utilization can also go to 100% if it is stuck at allreduce, how to figure out how much computational resource is taken? If it is close to 100%, it may also be waiting for allreduce or for networking communication.

You could profile your code with the PyTorch profiler or e.g. Nsight Systems to see when NCCL is waiting for a communication and if other computations are executed.

1 Like