How to properly run CUDA ops asynchronously across multiple streams in PyTorch?

  1. Rerun your code in a loop to get rid of the initialization/warmup artifacts and you would see the overlap:

  1. Yes, you have a race condition in the matmul call which can also be seen via:
TORCH_CUDA_SANITIZER=1 python main.py 
============================
CSAN detected a possible data race on tensor with data pointer 140543073976320
Access by stream 93955755406240 during kernel:
aten::mm(Tensor self, Tensor mat2) -> Tensor
reading from argument(s) self
With stack trace:
...
1 Like