Runtime Error: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling 'cublasSgemm'

I met a quite confused problem when running the https://github.com/NVIDIA/tacotron2.git

with :
a single rtx 2080ti GPU
pytorch verision 1.2.0
cuda version 10.0
cudnn version 7.6.4

The problem happens suddenly after hundreds of iterations’ training,usually 800~2000 steps.
Sometimes it gives a error message just like in the title. Some times it gives nothing and just seems stuck. When it gives the runtime error, i found the problem happened when the code implements a matrix multiplication in a nn.Linear layer (input multiply weight). The input and weight tensors are all FLOAT32 when i debugging.

I have searched this problem on internet and find that it is similar to a bug of FP16 operation which is solved with CUDA 10.1. But, when i update the cuda to 10.1 and pytorch to 1.5.0, it still happened. I also checked my code, all the tensors are FLOAT32 and no FP16 operation.

The same code has also been running on a 1080ti GPU with same cuda, cudnn, and pytorch version. And everything is OK, it seems that it happened only when i use a 2080ti GPU. Could anyone give some advices to avoid this problem?

Could you install the latest PyTorch version (1.5.0`) with CUDA10.2 and cudnn7.6.5.32, please?
Let me know, if you still see this error.

OK,i’m trying. When it finishes i’ll reply to you.

I have installed the pytorch 1.5.0 with CUDA 10.2 and cudnn 7.6.5.32.
It still gives a run time error after about 2000 iterations. But this time it gives another error message as follow. This time it seems happens when a torch.cat operation is implementing.

RuntimeError: cuda runtime error (719) : unspecified launch failure at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

Traceback (most recent call last):
File “train.py”, line 426, in
hparams,
File “train.py”, line 284, in train
y_pred_t_list = model(x, teacher=True)
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 664, in forward
teacher=teacher,
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 520, in forward
) = self.decode(decoder_input)
File “/home/server/workspace/projects/tacotron2/model.py”, line 465, in decode
decoder_input = torch.cat((prenet_output, self.attention_context), -1)

It seems that the problem happens randomly. Before i install the Pytorch 1.5.0, it once also gives a error message “corrupted size vs. prev_size” without any other Traceback.

Could you monitor the GPU memory usage and check, if you might be running out of memory?
Also, could you post the complete error message from the last post?

i suppose the complete error message is as follows?and when the code is running,it takes about 9000M while the whole GPU memory is about 11000M。So, it might not run out of memory. I’ll still have a try to use half of the current batch_size and check GPU memory again.

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=719 : unspecified launch failure
Traceback (most recent call last):
File “train.py”, line 426, in
hparams,
File “train.py”, line 284, in train
y_pred_t_list = model(x, teacher=True)
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 666, in forward
teacher=teacher,
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 522, in forward
) = self.decode(decoder_input)
File “/home/server/workspace/projects/tacotron2/model.py”, line 449, in decode
dim=1,
RuntimeError: cuda runtime error (719) : unspecified launch failure at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

The problem happens even i use half of the current batch_size…

Could you post the exact command you are using so that I can try to grab the same GPU and try to reproduce this issue?

I m not sure if this is a problem with pytorch or just a problem with my GPU or device or code. However the same code did run without error on another 1080ti device with same Linux, python, pytorch, cuda and cudnn version.

For me, i just did:
git clone git@github.com:zwlanpishu/tacotron2.git
git checkout baseline_test
python3 train.py --output_directory “yourpath” --log_directory “yourpath”

The data can be download from https://keithito.com/LJ-Speech-Dataset/
The txt files in the “filelists” dirrectory should be rewrite from “DUMMY/xxx” to “your data path/xxx”.

my env:
Ubuntu 18.04
python 3.6.9
pytorch 1.5.0
cuda 10.2
cudnn 7.6.5.32

The other pip dependencys are listed in requirements.txt, while tensorflow is not needed.

Thanks for the update. Is the original (unmodified) Tacotron2 code running on this device?
If so, could you rerun your fork with CUDA_LAUNCH_BLOCKING=1 python train.py --output_directory ... and post the stack trace here again?

Yes. I have reruned the original code and for this time the problem happens without giving any stack trace. It just seems stuck when training after some epochs. I’ll try once more to see if it would give some other error message.

The second time i run the code, it fails with below information:

Train loss 5454 0.626859 Grad Norm 0.503917 3.96s/it
Train loss 5455 0.583127 Grad Norm 0.679305 3.14s/it
Train loss 5456 0.622990 Grad Norm 0.589986 4.00s/it
Train loss 5457 0.644452 Grad Norm 0.451847 6.29s/it
terminate called after throwing an instance of ‘c10::Error’
what(): get_stat_type_for_pool: invalid pool (get_stat_type_for_pool at /pytorch/c10/cuda/CUDACachingAllocator.cpp:628)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f9cda062536 in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x14214 (0x7f9ce22f7214 in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x14a (0x7f9ce22f995a in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f9cda052abd in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x2d9973a (0x7f9cb959573a in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: + 0x2d9a954 (0x7f9cb9596954 in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0xf5b (0x7f9cb958276b in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7f9cb9583ce2 in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f9cb957c359 in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f9ce2c354d8 in /home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xbd6df (0x7f9ce3d066df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #11: + 0x76db (0x7f9ce885d6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x3f (0x7f9ce8b9688f in /lib/x86_64-linux-gnu/libc.so.6)

[1]+ killed CUDA_LAUNCH_BLOCKING=1 python3 train.py --output_directory /opt/checkpoints/test/ --log_directory log

The third time, it happened and throwed a segment fault without any other information. I wonder if there is something wrong with my device or GPUs.

Which driver version are you using?
Also, was this device working before properly? Could you run some stress tests on the GPU?

I use the latest dirver version 440.82 of RTX2080ti. It works properly when training a waveglow model or some other model before. It does confuse me… I will try to run some stress tests or run the code on another device.

Hi, I am not sure whether you have solved the problem, but I can share you my experience on this headache error.

I just meet the same runtime error suddenly when I run my previous bug-free codes to train the network with 2 GPUs simultaneously. The error is thrown out after several epochs and it even makes my machine freeze totally. However, the training script is exactly the same and it has worked for some weeks before. After I re-configure the environment, the problem still exists. Finally, when I test my script with single GPU card one by one, I realize one of my GPU cards is broken. So I recommend you do some GPU pressure test to check the GPU itself. Hope this late reply be useful to other guys suffering the same bug.

BTW, I seriously suspect the bad quality control of RTX2080Ti :slight_smile:

When there is a broken GPU card, it is always RTX2080Ti…

:sweat_smile:Thanks for your reply. I have figured out the problem. It seems that a specific slot of my memory module results in the problem. When i change to use other slots, the problem disappeared.

I’m having this same error with CUDA 10.1 and PyTorch 1.6.0. In particular, I’m using Google Colab. I’ve also checked the data type of the tensors involved and they are all float32. The code works locally on CPU.

One thing that’s different between my error and what everyone else appears to be experiencing is that mine appears right at the start of training. The bug still happens during a forward computation on an nn.Linear layer as the original poster described.

Because I am on Colab, I cannot fix the GPU directly/replace the GPU – what other options do I have?