CUBLAS_STATUS_NOT_SUPPORTED for BF16 (Cuda11.6, Pytorch)

Hi,
I just got the following error when I trained my Pytorch model with bfloat16 parameters

File “/opt/conda/envs/XXX/lib/python3.8/site-packages/torch/nn/modules/linear.py”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 768 n 2304 k 768 mat1_ld 768 mat2_ld 768 result_ld 768 abcType 14 computeType 68 scaleType 0

The types of input, self.weight, self.bias are all bfloat16 and the shapes are (9, 256, 768), (768, 768), (768, ), respectively.

My Pytorch version is 1.14.0.dev20221213+cu116, and my python version is 3.8.15.
Besides, I used 8 A100 GPUs (80 GB).

I ran “torch.cuda.is_bf16_supported()”, and got “True”.

Actually, I have tried different models with BF16 parameters but did not got the same error. However, for some reason, I could not share my model to you. Please tell me if you have any idea about the error. Thanks.

Also, I tried different Pytorch and CUDA version.

Pytorch1.12.1, CUDA11.6

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

Pytorch1.12.1, CUDA11.3

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

No error messages

Could you post a minimal, executable code snippet please, as I cannot reproduce the issue using your posted shapes on a current master build on an A100 ?