GPU Utilization Tutorial/Troubleshooting

Jeff_Cochran · August 16, 2024, 3:38pm

I’m running someone else’s model locally, and I noticed that, as written, the GPU utilization is really low, and the training is slow. My initial reaction was to increase the batch size two orders of magnitude, however the training was still slow and the utilization was < 5%.

So, I have two questions:

has anyone produced a nice tutorial on how to improve GPU utilization? This seems like an important topic for anyone creating their own models.
is there a set of “usual suspects” in model/dataset/dataloader code that, when overlooked, kill GPU performance?

Eduardo_Lawson · August 16, 2024, 5:22pm

This can be a very naive answer. But did you check if the model and data were sent to the device?

Soumya_Kundu · August 17, 2024, 8:04am

https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

Maybe this?