Hi, Thanks for amazing work. I am using RTX3090Ti GPU, and I am training on 1M images. My system GPU utilization is fluctuating and it’s not very fast as it should be. We trained with a small dataset with 8k images, the GPU utilization was stable, however, it is not stable when trained with 1M images. It is constantly fluctuating between 0 - 98% and slow. We varied the batch size and number of workers. My system has the following specs:
pytorch: 1.12.1 stable
could you please suggest some promising directions to fix this issue.
Make sure to share a minimal reproducible script so people can debug your error! (As opposed to describing your script)
I am using YOLOV5, the code is available in the yolov5 repository. yolov5/train.py at master · ultralytics/yolov5 · GitHub
“it is constantly fluctuating between 0 - 98% and slow” maybe you are reading data too slowly and thus your GPU waiting for new data most of the time ?
Try to see the time needed by your GPU to process a batch and the time needed by your dataloader to give a new batch ?
If you have 1M images maybe you should use a format adapted for big image datatasets like h5 files