I’ve encountered the same error but with torch 1.10.2
and CUDA 11.3
.
the error info as below:
Traceback (most recent call last):
File "train.py", line 349, in <module>
main_train()
File "train.py", line 315, in main_train
train(rank, args, train_dataset, valid_dataset, model, collator, tokenizer)
File "train.py", line 199, in train
loss += train_iter(model, batch, optimizer, scheduler, device)
File "train.py", line 99, in train_iter
labels=labels, return_dict=False)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1588, in forward
loss = self.module(*inputs, **kwargs)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 2370, in forward
return_dict=return_dict,
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 2238, in forward
return_dict=return_dict,
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 2102, in forward
use_cache=use_cache,
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 1024, in forward
output_attentions=output_attentions,
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huangbz/.conda/envs/Graph/lib/python3.6/site-packages/transformers/models/led/modeling_led.py", line 819, in forward
attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
terminate called after throwing an instance of 'c10::Error'
what(): NCCL error in: /opt/conda/conda-bld/pytorch_1640811805959/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:181, unhandled cuda error, NCCL version 21.0.3
the output of python -m torch.utils.collect_env
:
PyTorch version: 1.10.2
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.9
Python version: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-65-generic-x86_64-with-debian-bullseye-sid
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
GPU 2: NVIDIA A40
GPU 3: NVIDIA A40
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.10.2
[conda] blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit 11.3.1 h2bc3f7f_2 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl 2022.0.1 h06a4308_117 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy 1.19.5 <pip>
[conda] pytorch 1.10.2 py3.6_cuda11.3_cudnn8.2.0_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] pytorch-mutex 1.0 cuda https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
I installed torch via conda install pytorch cudatoolkit=11.3