CUDA error: device-side assert triggered in train_loss_set.append(loss.item())

Hi everyone! I am getting this error when I am trying to train a BERT model. It takes in 60 or something examples and then this error pops up. I made sure my labels are fine. There are no negative labels.

RuntimeError                              Traceback (most recent call last)
<ipython-input-34-87f0e3931c4a> in <module>
     24 
     25         loss = outputs[0]
---> 26         train_loss_set.append(loss.item())
     27         loss_train_total += loss.item()
     28         loss.backward()

RuntimeError: CUDA error: device-side assert triggered

You would want to run with blocking kernel launches to find the actual operation that causes this.
Then you can try to look at the batch in more detail and/or check the source code for the conditions under which it asserts.

Best regards

Thomas

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-34-87f0e3931c4a> in <module>
     20 
     21 
---> 22         outputs = model(**inputs)
     23 
     24 

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

C:\ProgramData\Anaconda3\lib\site-packages\transformers\modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states)
   1282             else:
   1283                 loss_fct = CrossEntropyLoss()
-> 1284                 loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
   1285             outputs = (loss,) + outputs
   1286 

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in forward(self, input, target)
    959 
    960     def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 961         return F.cross_entropy(input, target, weight=self.weight,
    962                                ignore_index=self.ignore_index, reduction=self.reduction)
    963 

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2466     if size_average is not None or reduce is not None:
   2467         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   2469 
   2470 

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:
-> 2264         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   2265     elif dim == 4:
   2266         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

RuntimeError: cuda runtime error (710) : device-side assert triggered at C:/cb/pytorch_1000000000000/work/aten/src\THCUNN/generic/ClassNLLCriterion.cu:115

This is usually a label problem. Can you do the following:

  • save logits.shape and labels.cpu() in temporary variables,
  • after the device assert is triggered, print these.
    If you have a label that is >= num_lables or < 0, you have the problem.

I did what you said and I got this


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-38-50cfb61e001e> in <module>
      2 lg_shape = logits.shape
      3 labels = inputs['labels']
----> 4 lb_cpu = labels.cpu()

RuntimeError: CUDA error: device-side assert triggered

and this for the logits.shape

torch.Size([3, 2])

I think you want blocking launches and that. And then after the runtime error (if you’re in Jupyter or when you launch python -i yourscript.py), you can print the labels to see if any are < 0 or > 1.

I used the blocking launches before trying to train the model. Got the long error and then tried to print the labels but got this.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-38-e847ac6dc72c> in <module>
----> 1 print(inputs['labels'])

C:\ProgramData\Anaconda3\lib\site-packages\torch\tensor.py in __repr__(self)
    177             return handle_torch_function(Tensor.__repr__, relevant_args, self)
    178         # All strings are unicode in Python 3.
--> 179         return torch._tensor_str._str(self)
    180 
    181     def backward(self, gradient=None, retain_graph=None, create_graph=False):

C:\ProgramData\Anaconda3\lib\site-packages\torch\_tensor_str.py in _str(self)
    370 def _str(self):
    371     with torch.no_grad():
--> 372         return _str_intern(self)

C:\ProgramData\Anaconda3\lib\site-packages\torch\_tensor_str.py in _str_intern(self)
    350                     tensor_str = _tensor_str(self.to_dense(), indent)
    351                 else:
--> 352                     tensor_str = _tensor_str(self, indent)
    353 
    354     if self.layout != torch.strided:

C:\ProgramData\Anaconda3\lib\site-packages\torch\_tensor_str.py in _tensor_str(self, indent)
    239         return _tensor_str_with_formatter(self, indent, summarize, real_formatter, imag_formatter)
    240     else:
--> 241         formatter = _Formatter(get_summarized_data(self) if summarize else self)
    242         return _tensor_str_with_formatter(self, indent, summarize, formatter)
    243 

C:\ProgramData\Anaconda3\lib\site-packages\torch\_tensor_str.py in __init__(self, tensor)
     83         if not self.floating_dtype:
     84             for value in tensor_view:
---> 85                 value_str = '{}'.format(value)
     86                 self.max_width = max(self.max_width, len(value_str))
     87 

C:\ProgramData\Anaconda3\lib\site-packages\torch\tensor.py in __format__(self, format_spec)
    532             return handle_torch_function(Tensor.__format__, relevant_args, self, format_spec)
    533         if self.dim() == 0:
--> 534             return self.item().__format__(format_spec)
    535         return object.__format__(self, format_spec)
    536 

RuntimeError: CUDA error: device-side assert triggered

I managed to solve the issue by putting the num_labels parameter when initializing the model. I somehow forgot to do that and the default was 2 I think and I had more than that. Thanks a lot though!