That’s not exactly how GPUs work. After a kernel issues all of its blocks, there is a “tail” of blocks that partially occupy the GPU. If there are just a few waves of blocks, then the tail can be a significant fraction of the total kernel execution time. Another kernel launched on a second stream could fill the device during the tail of the first kernel, increasing device utilization.
The issue for PyTorch would be that kernels that have a very short execution time will hit the limit of how fast you can launch kernels from the CPU (let alone from a python application).
CUDA Graphs were introduced to move kernel launch to the device side, reducing launch overhead for short running kernels.
Apparently there is an effort underway to use CUDA Graphs in PyTorch,
… but it seems to have hit a snag:
Anyway, as these issues are very interesting to me, I would like to learn more about what the roadmap looks like for this kind of functionality in PyTorch.