I am trying to train a Large model, the training goes fine, but when the evaluation starts, per iteration/step time increases over time and gets worse. It eventually stops at some point where there the GPU usage drops to zero but there are still some memory allocated in GPU. There is no I/O operations happening during this time and other hardware resources are used minimally. The issue only seems to appear when I try to increase the dataset size. When I have a smaller dataset the evaluation goes through completing and the training finishes though the patter of increasing per iteration/step time is still noticeable. Increasing batchsize does help in this case but the evalution bar stops moving over time. Is there any possible way to debug this?
Did you check if your host RAM is increasing and if it’s eventually using the swap
?
Thanks a lot for quick response. I ran into some issues with my system. I will check again soon and let you know. What I noticed before is the memory usage was not very high there was a good size of free RAM available (around 800gb).
Thank you for the suggestion. I’ve checked the host RAM and swap usage during the evaluation phase using detailed resource monitoring logs. It shows that both RAM and swap usage remain relatively stable throughout the training and evaluation periods, with no significant spikes or excessive consumption that would suggest memory constraints or heavy swapping activities.
So it seems you are still seeing a slowdown after a while while, which also seems to be related to the dataset size, but neither I/O ops are visible nor the usage of swap space. This is indeed really weird and you might need to use a visual profiler to check which part of the code is causing the slowdown. You could also check the thermal readings to see if the frequencies are reduced due to overheating.
Thank you for your suggestions. I’ve used profiler to watch over the operations during the evaluation process. The profiler’s outputs shows that certain operations, particularly aten::copy_
and aten::cudnn_convolution
, are heavily utilized and might be contributing to the slowdown.
Thanks for checking the profile! In this case I don’t understand how the size of the dataset could be related and would guess your GPU might run into thermal issues.
Monitor the clocks via nvidia-smi dmon
to see if the frequencies drop.
I was able to fix this issue by decreasing num-workers parameter to zero, Thanks for the help @ptrblck. Though I still don’t understand the exact reason for the error the evaluation sped up when I increased number of workers.