DDP crash after fix number of iterations

I use ddp for distributed training. After around 1022 iterations (changing batch size will result in different but fixed crash iteration), the program will crash with error log: “an illegal memory access was encountered”. After I replace all Linear layer with torch.matmul and manully generated weight tensor, the error still there, and it shows that:
" CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)"

Any idea why it happen?

I would recommend trying to narrow down the illegal memory access via cuda-gdb, compute-sanitizer, or by creating a CUDA coredump for further debugging. cuBLAS might be the victim of the triggered sticky CUDA failure.
In case you are using an older release I would recommend updating to the latest stable or nightly release.