CUDA tensor values become insanely large after a CUDA error occurred

I’ve encountered this weird problem while running Huggingface transformers’ BART model. My code executed just fine until it came across a batch of data, after which an error occurred:

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

After that, whatever I put into CUDA, its value will become insanely large.
For example, if I declare a `torch.arange(1, 16, 1) and put it onto cuda, it will become as follows:

What’s more, the insanely large value will vary each time I print it

It seems that some memory leak problem occurred on CUDA, but what exactly will it be?

Has anybody come across the same problem? I have struggled with this for decades. I’ll appreciate it if someone helps me.

p.s.
The full traceback is as follows:

  File "my_conda_env_path/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "my_conda_env_path/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "my_conda_env_path/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py", line 192, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "my_conda_env_path/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "my_conda_env_path/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py", line 331, in forward
    hidden_states, attn_weights, _ = self.self_attn(
  File "my_conda_env_path/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "my_conda_env_path/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py", line 856, in forward
    layer_outputs = encoder_layer(
  File "my_conda_env_path/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "my_conda_env_path/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py", line 1237, in forward
    encoder_outputs = self.encoder(
  File "my_conda_env_path/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "my_conda_env_path/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py", line 1373, in forward
    outputs = self.model(
  File "demos.py", line 43, in main
    model_res = model.forward(**batch.to(device), output_hidden_states=True)
  File "my_conda_env_path/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "demos.py", line 81, in <module>
    main()

the batch of data which caused the insane problem is as follows:

{'input_ids': tensor([[    0, 41552, 45692,  ...,     1,     1,     1],
        [    0, 41552, 45692,  ...,     1,     1,     1],
        [    0, 41552, 45692,  ...,     1,     1,     1],
        ...,
        [    0, 41552, 45692,  ..., 15698, 50264,     2],
        [    0, 41552, 45692,  ...,     1,     1,     1],
        [    0, 41552, 45692,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[    0, 41552, 45692,  ...,  -100,  -100,  -100],
        [    0, 41552, 45692,  ...,  -100,  -100,  -100],
        [    0, 41552, 45692,  ...,  -100,  -100,  -100],
        ...,
        [    0, 41552, 45692,  ...,   442,   479,     2],
        [    0, 41552, 45692,  ...,  -100,  -100,  -100],
        [    0, 41552, 45692,  ...,  -100,  -100,  -100]])}

To my best knowledge, neither of these input_ids or labels have indices out of my model’s Embedding range (except for -100 which is used for skip id in labels).

An assert on the GPU will corrupt the CUDA context, subsequent CUDA calls are invalid and would return undefined behavior.
Could you post a minimal and executable code snippet reproducing the initial cublas error as well as the output of python -m torch.utils.collect_env?

Thank you for replying, my environment information from python -m torch.utils.collect_env is as follows:

PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.25.0-rc2
Libc version: glibc-2.17

Python version: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.4.152
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 470.82.01
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.8.0
/usr/lib64/libcudnn_adv_infer.so.8.8.0
/usr/lib64/libcudnn_adv_train.so.8.8.0
/usr/lib64/libcudnn_cnn_infer.so.8.8.0
/usr/lib64/libcudnn_cnn_train.so.8.8.0
/usr/lib64/libcudnn_ops_infer.so.8.8.0
/usr/lib64/libcudnn_ops_train.so.8.8.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.13.1+cu117
[pip3] torchaudio==0.13.1
[pip3] torchvision==0.14.1
[conda] mkl                       2023.1.0                 pypi_0    pypi
[conda] mkl-fft                   1.3.1                    pypi_0    pypi
[conda] mkl-random                1.2.2                    pypi_0    pypi
[conda] mkl-service               2.4.0                    pypi_0    pypi
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.13.1+cu117             pypi_0    pypi
[conda] torchaudio                0.13.1                   pypi_0    pypi
[conda] torchvision               0.14.1                   pypi_0    pypi

The code snippet reproducing the error is as follows

import os
from transformers import BartForConditionalGeneration
import pickle as pkl

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
model = BartForConditionalGeneration.from_pretrained('./pretrained-models/bart-base').to('cuda:0')
batches = pkl.load(open('error_batches.pkl', 'rb'))
for batch in batches:
    model(**batch.to('cuda:0'))

This snippet requires some external data stored in pickle format, I’ve shared it via Google Drive, the link is error_batches.pkl - Google Drive.

Sincerely looking forward to your reply!

Could you create random inputs e.g. via torch.randn to reproduce the issue or does the failure require the pkl file?