How to properly run CUDA ops asynchronously across multiple streams in PyTorch?

ptrblck · December 10, 2025, 4:09pm

Rerun your code in a loop to get rid of the initialization/warmup artifacts and you would see the overlap:

Yes, you have a race condition in the matmul call which can also be seen via:

TORCH_CUDA_SANITIZER=1 python main.py 
============================
CSAN detected a possible data race on tensor with data pointer 140543073976320
Access by stream 93955755406240 during kernel:
aten::mm(Tensor self, Tensor mat2) -> Tensor
reading from argument(s) self
With stack trace:
...