Allreduce takes 100% gpu utilization if it is getting

amsword · March 19, 2022, 7:37am

As GPU utilization can also go to 100% if it is stuck at allreduce, how to figure out how much computational resource is taken? If it is close to 100%, it may also be waiting for allreduce or for networking communication.

ptrblck · March 20, 2022, 4:30am

You could profile your code with the PyTorch profiler or e.g. Nsight Systems to see when NCCL is waiting for a communication and if other computations are executed.