The GPU utilization remains unchanged during slower training

xgbj · April 7, 2024, 6:24am

Hello everyone, we recently encountered an issue where our distributed storage system became very slow during a training process. As a result, the training speed decreased to one-fifth of its usual speed. From the monitoring data, it is evident that the main cause of this slowdown is the increased time taken for data loading, as shown in the following graph:

Due to the synchronization of Sync-BN across multiple GPUs, the forward time of our training process has also slowed down, which is expected. However, we are curious as to why the GPU utilization has not changed significantly during this period.

Does anyone know why this is the case? (We speculate that one possible reason is that NCCL synchronization may lead to excessively high GPU utilization, but we couldn’t find any relevant information online.)

ptrblck · April 7, 2024, 1:57pm

NCCL communications use HW resources and are thus showing a high GPU utilization even if the current process is only waiting.

xgbj · April 7, 2024, 2:39pm

Thank you for your patient reply. So it seems that the higher GPU utilization caused by NCCL is a normal phenomenon. btw, If I want to learn more details about the behavior of NCCL using hardware resources that you mentioned in your reply, what links or documents can I refer to?