Find the bottleneck of suddenly slowed traning

JohnHerry · March 7, 2024, 2:19am

I am traning on a environment that with shared CPU and memory resoure with others, some cards of GPUS on the machine are taken by me exclusively. the training data are put on the other machine and I can access those data through the NIS system.

Yes today I make a train and the traning speed is good as my expect. today I restart another traning of the same model, but the traing speed got suddenly slow, and each step traning cost more then 100 times then ever, There is another person traning his job, I think there is resource competation. but I do not know how to find out which is the bottleneck. It is not the CPU ore GPU cards, because they are not taken too much. maybe local machine memory, maybe the io through network, or maybe the disk speed on the NIS system. Is there any suggestion for finding the bottleneck while not disturbing another person’s training?

ptrblck · March 7, 2024, 8:52pm

You could profile the workload via e.g. Nsight Systems to narrow down the bottleneck during the concurrent runs.