What is a minimum code that would cause RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)` error?

I am looking for a minimal PyTorch code that would cause the following error.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

The reason for this request is that I get this error in two out of my three machines and I am not sure what’s the best way to go about debugging it in a large code-base. Here’s the related post Vertices=torch.matmul(vertices.unsqueeze(0), rotations_init), RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched in CentOS

You could enable cublas logging and check for errors of post the logs here.