I use ddp for distributed training. After around 1022 iterations (changing batch size will result in different but fixed crash iteration), the program will crash with error log: “an illegal memory access was encountered”. After I replace all Linear layer with torch.matmul and manully generated weight tensor, the error still there, and it shows that:
" CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
"
Any idea why it happen?