CUDA error over torch.linalg.lstsq with large input tensor size

Hi,

I’ve been using torch.linalg.lstsq recently but it appears that a cuda run time error may occur on GPU when the input size is relatively large, and I am not sure whether the error is indeed due to the large matrix size or due to CUDA/torch versions, since there was no error with CPU tensor of the same size.

The error is

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasStrsmBatched( handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)

A minimum working example could be:

import torch
a = torch.rand(1, 3, 2)
b = torch.rand(1, 3, 2500*2500) # 1 x 3 x 6250000
# no error
torch.linalg.lstsq(a, b)
# error
torch.linalg.lstsq(a.cuda(), b.cuda())
# no error if making b smaller
torch.linalg.lstsq(a.cuda(), b[:, :, :65536].cuda())

Below is the environment info from torch.utils.collect_env.

Collecting environment information…
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: (Rev5, Built by MSYS2 project) 5.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 8 2023, 10:42:25) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA TITAN RTX
GPU 1: NVIDIA TITAN RTX

Nvidia driver version: 536.23
[conda] mkl_random 1.2.2 py310h4ed8f06_0
[conda] numpy 1.23.5 py310h60c9a35_0
[conda] numpy-base 1.23.5 py310h04254f7_0
[conda] pytorch 1.13.1 py3.10_cuda11.7_cudnn8_0 pytorch
[conda] pytorch-cuda 11.7 h67b0de4_1 pytorch
[conda] pytorch-lightning 2.0.3 py310haa95532_0
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.13.1 pypi_0 pypi
[conda] torchmetrics 1.0.3 pyhd8ed1ab_0 conda-forge
[conda] torchnet 0.0.4 pypi_0 pypi
[conda] torchvision 0.15.2 cpu_py310h7187fe4_0

Thanks,

Hi M!

I can reproduce your issue (with a slightly different error message) with the
current stable pytorch and a rather old gpu.

Here is my test script:

import torch
print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_properties (0))

_ = torch.manual_seed (2024)

a = torch.rand(1, 3, 2)
b = torch.rand(1, 3, 2500*2500) # 1 x 3 x 6250000
torch.linalg.lstsq(a.cuda(), b.cuda())

And here is its output:

2.1.2
11.8
_CudaDeviceProperties(name='GeForce GTX 1050 Ti', major=6, minor=1, total_memory=4040MB, multi_processor_count=6)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 14, in <module>
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasStrsmBatched( handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)`

It might make sense to log this as a github issue.

@ptrblck: I see a number of similar issues in github that seem to be being
taken seriously, but not this exact error.

Best.

K. Frank

Hi Frank,

thanks for the additional information. I submitted the ticket here and hopefully there will be some updates.

Best,

As a fix, please update PyTorch to the latest release with CUDA 12.1.
We’ll continue the discussion in the linked GitHub issue.

@KFrank thanks for pinging me on this topic!

Thanks for the reply.

I created another environment with pytorch 2.1.2 and cuda 12.1 in Ubuntu and somehow the problem persists but now getting the same error as Frank’s:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasStrsmBatched( handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)

Environment:

PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA TITAN RTX
GPU 1: NVIDIA TITAN RTX

Nvidia driver version: 535.146.02
cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      43 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              113
Model name:                         AMD Ryzen 7 3800X 8-Core Processor
Stepping:                           0
Frequency boost:                    enabled
CPU MHz:                            4101.621
CPU max MHz:                        4100.0000
CPU min MHz:                        2200.0000
BogoMIPS:                           7800.37
Virtualization:                     AMD-V
L1d cache:                          256 KiB
L1i cache:                          256 KiB
L2 cache:                           4 MiB
L3 cache:                           32 MiB
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca sme sev sev_es

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] pytorch-lightning==2.1.3
[pip3] torch==2.1.2
[pip3] torchaudio==2.1.2
[pip3] torchmetrics==1.1.2
[pip3] torchnet==0.0.4
[pip3] torchvision==0.16.2
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl    anaconda
[conda] libopenvino-pytorch-frontend 2023.2.0             h59595ed_4    conda-forge
[conda] mkl                       2023.1.0         h213fc3f_46344    anaconda
[conda] mkl-service               2.4.0           py310h5eee18b_1    anaconda
[conda] mkl_fft                   1.3.8           py310h5eee18b_0    anaconda
[conda] mkl_random                1.2.4           py310hdb19cb5_0    anaconda
[conda] numpy                     1.26.3          py310h5f9d8c6_0  
[conda] numpy-base                1.26.3          py310hb5e798b_0  
[conda] pytorch                   2.1.2           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-lightning         2.1.3              pyhd8ed1ab_0    conda-forge
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.1.2               py310_cu121    pytorch
[conda] torchmetrics              1.1.2           py310h06a4308_0  
[conda] torchtriton               2.1.0                     py310    pytorch
[conda] torchvision               0.16.2              py310_cu121    pytorch