Error when training network for larger amount of time - CUDA error: unspecified launch failure

sid-ls · May 15, 2020, 5:05am

I’m getting a CUDA error: unspecified launch failure error when training for longer periods of time. When I was training for few hours for 20 epochs everything worked fine, but when I train for over 40 epochs my code terminates randomly with a CUDA error: unspecified launch failure.

I’m using pin_memory=False in my dataloader, if that matters, and when I train one epoch my GPU consumption is at 2gb/8gb. Are there any GPU debugging tricks i can use to find out where the problem is?

The fact that I not get the error at first, but only later makes me thinking its about GPU allocating resources

Traceback (most recent call last):
  File "c:/Users/C/Desktop/projects/M5/train.py", line 51, in <module>
    scheduler=scheduler, norm_factors=norm_vec, norm_idx=norm_idx)
  File "c:\Users\C\Desktop\projects\M5\seq2seq_forecasting\seq2seq_forecasting\core.py", line 413, in train
    outputs = model(x, target.clone())
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "c:\Users\C\Desktop\projects\M5\seq2seq_forecasting\seq2seq_forecasting\core.py", line 245, in forward
    outp, hiddens = self.decode_n(dec_inp,hiddens)
  File "c:\Users\C\Desktop\projects\M5\seq2seq_forecasting\seq2seq_forecasting\core.py", line 231, in decode_n
    outp = self.out_act(self.fc(o))
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\modules\linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\functional.py", line 1370, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: unspecified launch failure

ptrblck · May 15, 2020, 5:51am

Could you rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace again, please?
Since CUDA operations are asynchronous, the current stack trace might point to a wrong location.

Was this device working fine before and did you update anything in your setup?

Also, you could try to use anomaly detection to check, if something in your model goes wrong, although I wouldn’t expect a launch failure in this case.

sid-ls · May 15, 2020, 6:15am

Oh okay that makes sense. The machine I’m on currently is Windows so I’m setting CUDA_LAUNCH_BLOCKING=1 in the system settings.

I’ll try that, thank you. I’m also logging GPU memory to see if OOM is the problem

sid-ls · May 15, 2020, 7:07pm

I did os.environ[‘CUDA_LAUNCH_BLOCKING’] = ‘1’ in the script at the very beginning.

  File "c:/Users/C/Desktop/projects/M5/train.py", line 54, in <module>
    model = core.train(model, criterion, optimizer, train_dataloader, valid_dataloader, nb_epochs=epochs,
  File "c:\Users\C\Desktop\projects\M5\seq2seq_forecasting\seq2seq_forecasting\core.py", line 479, in train
    outputs = model(x, target.clone())
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "c:\Users\C\Desktop\projects\M5\seq2seq_forecasting\seq2seq_forecasting\core.py", line 245, in forward
    outp, hiddens = self.decode_n(dec_inp,hiddens)
  File "c:\Users\C\Desktop\projects\M5\seq2seq_forecasting\seq2seq_forecasting\core.py", line 231, in decode_n
    outp = self.out_act(self.fc(o))
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\modules\linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\Users\C\AppData\Local\Programs\Miniconda3\envs\torch\lib\site-packages\torch\nn\functional.py", line 1370, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: unspecified launch failure

Looks pretty much the same trace, I’m guessing setting the environment variable that way didn’t work?

ptrblck · May 16, 2020, 5:29am

Thanks for the update. I’m not completely sure, how to use this env var on Windows, but could you try to create a code snippet to reproduce this error, so that we could have a look, please?

sid-ls · May 18, 2020, 12:39am

Actually, I haven’t seen that error since I stopped training on Jupyter Notebooks. Not sure if theres a direct link here or not, but I don’t get this error when I’m training through python scripts

Let me look into it. If I encounter it again I’ll create a code snippet to reproduce it