When moving from cuda-11.3 to cuda-11.6+, a call to torch.nn.functional.linear() began to fail with a CUBLAS_STATUS_NOT_SUPPORTED error. I was able to reproduce the error using the following script, which aligns one of the input tensors involved in the linear() operation on a “torch.half” boundary.
import torch
import torch.nn.functional as F
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
pad = torch.rand((1), requires_grad=True, dtype=torch.half, device="cuda")
A = torch.rand((5120, 2560), requires_grad=True, dtype=torch.half, device="cuda")
all_tensors = [pad, A]
new_tensors = _unflatten_dense_tensors(_flatten_dense_tensors([p.clone().detach() for p in all_tensors]), all_tensors)
pad, A = new_tensors
X = torch.rand((26, 1, 2560), requires_grad=True, dtype=torch.half, device="cuda")
B = torch.rand((5120), requires_grad=True, dtype=torch.half, device="cuda")
out = F.linear(X, A, B)
print(out)
The following trace is produced on the nightly pytorch against cuda-11.6, but from what I can tell it affects pytorch-1.12+ and cuda-11.6+. I ran the test script above as CUBLASLT_LOG_LEVEL=5 python test.py
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatmulDescCreate] matmulDesc=0X7FFC66EBCDD8 computeType=COMPUTE_32F scaleType=0
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatmulDescSetAttribute] matmulDesc=0X55C8AC9AC430 attr=MATMUL_DESC_TRANSA buf=0X7FFC66EBCDB8 sizeInBytes=4
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatmulDescSetAttribute] matmulDesc=0X55C8AC9AC430 attr=MATMUL_DESC_TRANSB buf=0X7FFC66EBCDBC sizeInBytes=4
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatmulDescSetAttribute] matmulDesc=0X55C8AC9AC430 attr=MATMUL_DESC_EPILOGUE buf=0X7FFC66EBCDC0 sizeInBytes=4
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatmulDescSetAttribute] matmulDesc=0X55C8AC9AC430 attr=MATMUL_DESC_BIAS_POINTER buf=0X7FFC66EBCFB8 sizeInBytes=8
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatrixLayoutCreate] matLayout=0X7FFC66EBCDD8 type=R_16F rows=2560 cols=5120 ld=2560
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatrixLayoutCreate] matLayout=0X7FFC66EBCDD8 type=R_16F rows=2560 cols=26 ld=2560
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatrixLayoutCreate] matLayout=0X7FFC66EBCDD8 type=R_16F rows=5120 cols=26 ld=5120
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatmulPreferenceCreate] matmulPref=0X7FFC66EBCDD8
[2023-01-13 18:46:05][cublasLt][265705][Api][cublasLtMatmulPreferenceSetAttribute] pref=0X55C8AC9ADF00 attr=MATMUL_PREF_MAX_WORKSPACE_BYTES buf=0X7FFC66EBCDC8 sizeInBytes=8
[2023-01-13 18:46:06][cublasLt][265705][Api][cublasLtMatmulAlgoGetHeuristic] Adesc=[type=R_16F rows=2560 cols=5120 ld=2560] Bdesc=[type=R_16F rows=2560 cols=26 ld=2560] Cdesc=[type=R_16F rows=5120 cols=26 ld=5120] preference=[maxWavesCount=0.0 maxWorkspaceSizeinBytes=1048576] computeDesc=[computeType=COMPUTE_32F scaleType=R_32F transa=OP_T epilogue=EPILOGUE_BIAS biasPointer=0x7fe48bc20a00]
[2023-01-13 18:46:06][cublasLt][265705][Info][cublasLtMatmulAlgoGetHeuristic] heuristicResults=[6]
[2023-01-13 18:46:06][cublasLt][265705][Api][cublasLtMatmul] A=0X7FE452000002 Adesc=0X55C8AC9AD160 B=0X7FE48BC00200 Bdesc=0X55C8AC9AD5F0 C=0X7FE48BC23200 Cdesc=0X55C8AC9AD630 D=0X7FE48BC23200 Ddesc=0X55C8AC9AD630 computeDesc=0X55C8AC9AC430 algo=0X7FFC66EBCE00 workSpace=0X7FE48BC64200 workSpaceSizeInBytes=1048576 stream=0X0
[2023-01-13 18:46:06][cublasLt][265705][Trace][cublasLtMatmul] A=0X7FE452000002 Adesc=[type=R_16F rows=2560 cols=5120 ld=2560] B=0X7FE48BC00200 Bdesc=[type=R_16F rows=2560 cols=26 ld=2560] C=0X7FE48BC23200 Cdesc=[type=R_16F rows=5120 cols=26 ld=5120] D=0X7FE48BC23200 Ddesc=[type=R_16F rows=5120 cols=26 ld=5120] computeDesc=[computeType=COMPUTE_32F scaleType=R_32F transa=OP_T epilogue=EPILOGUE_BIAS biasPointer=0x7fe48bc20a00] algo=[algoId=6 tile=MATMUL_TILE_64x64 stages=MATMUL_STAGES_64x6] workSpace=0X7FE48BC64200 workSpaceSizeInBytes=1048576 beta=0 outOfPlace=0 stream=0X0
[2023-01-13 18:46:06][cublasLt][265705][Api][cublasLtMatmulPreferenceDestroy] matmulPref=0X55C8AC9ADF00
[2023-01-13 18:46:06][cublasLt][265705][Api][cublasLtMatrixLayoutDestroy] matLayout=0X55C8AC9AD630
[2023-01-13 18:46:06][cublasLt][265705][Api][cublasLtMatrixLayoutDestroy] matLayout=0X55C8AC9AD5F0
[2023-01-13 18:46:06][cublasLt][265705][Api][cublasLtMatrixLayoutDestroy] matLayout=0X55C8AC9AD160
[2023-01-13 18:46:06][cublasLt][265705][Api][cublasLtMatmulDescDestroy] matmulDesc=0X55C8AC9AC430
Traceback (most recent call last):
File "/home/ubuntu/src/augment/models/gpt-neox/mytest.py", line 13, in <module>
out = F.linear(X, A, B)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 5120 n 26 k 2560 mat1_ld 2560 mat2_ld 2560 result_ld 5120 abcType 2 computeType 68 scaleType 0
To summarize, it appears the cublasLtMatmul is not happy with a 2-byte aligned matrix when bias is included and the other two matrices are aligned on a larger boundary.
It seems the heuristic chosen to satisfy this Matmul is possibly incorrect, and I can’t tell if the error is in pytorch asking for the heuristic, or cuda-11.6 choosing the incorrect heursitic.
I’m happy to provide additional information or reproductions if needed. Thanks!