Fail to make use of the Pytorch GPU asynchronous operation

ptrblck · August 2, 2022, 3:32pm

This post and the GTC presentation linked in my other post in the same thread might be interesting.
TL;DR: use streams and make sure compute resources are available. Matmuls tend to have a high occupancy, so you might not be able to overlap them.