Hi, we are using 1M images to train and validate. We are using yolov5m, and while training system RAM is increasing and reaches to it’s maximum limit. While it’s approaching to it’s maximum limit, the GPU utilization plummet to a very low value and it fluctuates constantly. When we kill some processes run by python by resuming training, it creates some space in the RAM; then, again GPU utilization gets stable. Please see screenshots for further insight. My question is, what to do to keep RAM in a safe limit, so GPU utilization does not fall down to a low value? Is there any code or change in code that keeps it in an optimal range? if there are any other steps please let me know since we have to resume training again and again to make some space in the RAM by killing processes run by python. After it starts using swap memory, during the swap memory use, it’s gpu usage is still low.
Based on your description it seems your script is increasing the memory usage by e.g. storing tensors (or references) which might still be attached to a computation graph.
You didn’t post any code snippets, so I don’t know what exactly is going on in your script, but try to narrow down which part of the code is increasing the memory by adding debug print statements.
I am using yolov5 repository. yolov5/train.py at master · ultralytics/yolov5 · GitHub