GPU Utilization Tutorial/Troubleshooting

I’m running someone else’s model locally, and I noticed that, as written, the GPU utilization is really low, and the training is slow. My initial reaction was to increase the batch size two orders of magnitude, however the training was still slow and the utilization was < 5%.

So, I have two questions:

  • has anyone produced a nice tutorial on how to improve GPU utilization? This seems like an important topic for anyone creating their own models.
  • is there a set of “usual suspects” in model/dataset/dataloader code that, when overlooked, kill GPU performance?

This can be a very naive answer. But did you check if the model and data were sent to the device?

https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

Maybe this?

1 Like