Yes, you have a race condition in the matmul call which can also be seen via:
TORCH_CUDA_SANITIZER=1 python main.py
============================
CSAN detected a possible data race on tensor with data pointer 140543073976320
Access by stream 93955755406240 during kernel:
aten::mm(Tensor self, Tensor mat2) -> Tensor
reading from argument(s) self
With stack trace:
...