with :
a single rtx 2080ti GPU
pytorch verision 1.2.0
cuda version 10.0
cudnn version 7.6.4
The problem happens suddenly after hundreds of iterations’ training,usually 800~2000 steps.
Sometimes it gives a error message just like in the title. Some times it gives nothing and just seems stuck. When it gives the runtime error, i found the problem happened when the code implements a matrix multiplication in a nn.Linear layer (input multiply weight). The input and weight tensors are all FLOAT32 when i debugging.
I have searched this problem on internet and find that it is similar to a bug of FP16 operation which is solved with CUDA 10.1. But, when i update the cuda to 10.1 and pytorch to 1.5.0, it still happened. I also checked my code, all the tensors are FLOAT32 and no FP16 operation.
The same code has also been running on a 1080ti GPU with same cuda, cudnn, and pytorch version. And everything is OK, it seems that it happened only when i use a 2080ti GPU. Could anyone give some advices to avoid this problem?
I have installed the pytorch 1.5.0 with CUDA 10.2 and cudnn 7.6.5.32.
It still gives a run time error after about 2000 iterations. But this time it gives another error message as follow. This time it seems happens when a torch.cat operation is implementing.
RuntimeError: cuda runtime error (719) : unspecified launch failure at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
Traceback (most recent call last):
File “train.py”, line 426, in
hparams,
File “train.py”, line 284, in train
y_pred_t_list = model(x, teacher=True)
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 664, in forward
teacher=teacher,
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 520, in forward
) = self.decode(decoder_input)
File “/home/server/workspace/projects/tacotron2/model.py”, line 465, in decode
decoder_input = torch.cat((prenet_output, self.attention_context), -1)
It seems that the problem happens randomly. Before i install the Pytorch 1.5.0, it once also gives a error message “corrupted size vs. prev_size” without any other Traceback.
Could you monitor the GPU memory usage and check, if you might be running out of memory?
Also, could you post the complete error message from the last post?
i suppose the complete error message is as follows?and when the code is running,it takes about 9000M while the whole GPU memory is about 11000M。So, it might not run out of memory. I’ll still have a try to use half of the current batch_size and check GPU memory again.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=719 : unspecified launch failure
Traceback (most recent call last):
File “train.py”, line 426, in
hparams,
File “train.py”, line 284, in train
y_pred_t_list = model(x, teacher=True)
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 666, in forward
teacher=teacher,
File “/home/server/workspace/projects/tacotron2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/server/workspace/projects/tacotron2/model.py”, line 522, in forward
) = self.decode(decoder_input)
File “/home/server/workspace/projects/tacotron2/model.py”, line 449, in decode
dim=1,
RuntimeError: cuda runtime error (719) : unspecified launch failure at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
I m not sure if this is a problem with pytorch or just a problem with my GPU or device or code. However the same code did run without error on another 1080ti device with same Linux, python, pytorch, cuda and cudnn version.
For me, i just did:
git clone git@github.com:zwlanpishu/tacotron2.git
git checkout baseline_test
python3 train.py --output_directory “yourpath” --log_directory “yourpath”
The data can be download from https://keithito.com/LJ-Speech-Dataset/
The txt files in the “filelists” dirrectory should be rewrite from “DUMMY/xxx” to “your data path/xxx”.
my env:
Ubuntu 18.04
python 3.6.9
pytorch 1.5.0
cuda 10.2
cudnn 7.6.5.32
The other pip dependencys are listed in requirements.txt, while tensorflow is not needed.
Thanks for the update. Is the original (unmodified) Tacotron2 code running on this device?
If so, could you rerun your fork with CUDA_LAUNCH_BLOCKING=1 python train.py --output_directory ... and post the stack trace here again?
Yes. I have reruned the original code and for this time the problem happens without giving any stack trace. It just seems stuck when training after some epochs. I’ll try once more to see if it would give some other error message.
I use the latest dirver version 440.82 of RTX2080ti. It works properly when training a waveglow model or some other model before. It does confuse me… I will try to run some stress tests or run the code on another device.
Hi, I am not sure whether you have solved the problem, but I can share you my experience on this headache error.
I just meet the same runtime error suddenly when I run my previous bug-free codes to train the network with 2 GPUs simultaneously. The error is thrown out after several epochs and it even makes my machine freeze totally. However, the training script is exactly the same and it has worked for some weeks before. After I re-configure the environment, the problem still exists. Finally, when I test my script with single GPU card one by one, I realize one of my GPU cards is broken. So I recommend you do some GPU pressure test to check the GPU itself. Hope this late reply be useful to other guys suffering the same bug.
Thanks for your reply. I have figured out the problem. It seems that a specific slot of my memory module results in the problem. When i change to use other slots, the problem disappeared.
I’m having this same error with CUDA 10.1 and PyTorch 1.6.0. In particular, I’m using Google Colab. I’ve also checked the data type of the tensors involved and they are all float32. The code works locally on CPU.
One thing that’s different between my error and what everyone else appears to be experiencing is that mine appears right at the start of training. The bug still happens during a forward computation on an nn.Linear layer as the original poster described.
Because I am on Colab, I cannot fix the GPU directly/replace the GPU – what other options do I have?