This post and the GTC presentation linked in my other post in the same thread might be interesting.
TL;DR: use streams and make sure compute resources are available. Matmuls tend to have a high occupancy, so you might not be able to overlap them.
This post and the GTC presentation linked in my other post in the same thread might be interesting.
TL;DR: use streams and make sure compute resources are available. Matmuls tend to have a high occupancy, so you might not be able to overlap them.