Device side assert triggered while printing loss

Hello all,
I know this particular error as In the heading has been encountered and discussed at length in the previous discussions on this forum, but I am simply not able to find the reason as to why I am encountering it, I get the error - Device side assert triggered when I am printing my loss. I am training a segmentation network on the PASCAL VOC dataset and my training loop is as follows -

for i in range(100):
    epoch_loss = 0
    num_nan = 0
    for _, data in enumerate(dataloader):
        image = data['image'].cuda()
        mask = data['ground_truth'].cuda()
        with autocast():
            loss = model((image, mask))
            print(loss)
        
        scaler.scale(loss).backward()
        scaler.unscale_(optim)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        scaler.step(optim)
        scaler.update()
        epoch_loss += loss.item()
        del loss
        torch.cuda.empty_cache()
        
        
    print(f'Epoch Loss = {epoch_loss / _}, Number of Nans = {num_nan}')
    #scheduler.step()

and the stack trace is as follows -

RuntimeError                              Traceback (most recent call last)
<ipython-input-1-686f5bbeb585> in <module>
   1138         with autocast():
   1139             loss = model((image, mask))
-> 1140             print(loss)
   1141 
   1142         scaler.scale(loss).backward()

/opt/conda/lib/python3.7/site-packages/torch/tensor.py in __repr__(self)
    177             return handle_torch_function(Tensor.__repr__, relevant_args, self)
    178         # All strings are unicode in Python 3.
--> 179         return torch._tensor_str._str(self)
    180 
    181     def backward(self, gradient=None, retain_graph=None, create_graph=False):

/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in _str(self)
    370 def _str(self):
    371     with torch.no_grad():
--> 372         return _str_intern(self)

/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in _str_intern(self)
    350                     tensor_str = _tensor_str(self.to_dense(), indent)
    351                 else:
--> 352                     tensor_str = _tensor_str(self, indent)
    353 
    354     if self.layout != torch.strided:

/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in _tensor_str(self, indent)
    239         return _tensor_str_with_formatter(self, indent, summarize, real_formatter, imag_formatter)
    240     else:
--> 241         formatter = _Formatter(get_summarized_data(self) if summarize else self)
    242         return _tensor_str_with_formatter(self, indent, summarize, formatter)
    243 

/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in __init__(self, tensor)
     87 
     88         else:
---> 89             nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
     90 
     91             if nonzero_finite_vals.numel() == 0:

RuntimeError: CUDA error: device-side assert triggered

Now as it can be seen, the error comes in the print function. I monitor my memory, and memory is not really an issue as I clear cache at the end.

If I do not print -
Then the error comes in the backward call as follows -

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-000e70ad2e12> in <module>
   1139             loss = model((image, mask))
   1140 
-> 1141         scaler.scale(loss).backward()
   1142         scaler.unscale_(optim)
   1143         torch.nn.utils.clip_grad_norm_(model.parameters(), 1)

/opt/conda/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    219                 retain_graph=retain_graph,
    220                 create_graph=create_graph)
--> 221         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    222 
    223     def register_hook(self, hook):

/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    130     Variable._execution_engine.run_backward(
    131         tensors, grad_tensors_, retain_graph, create_graph,
--> 132         allow_unreachable=True)  # allow_unreachable flag
    133 
    134 

RuntimeError: CUDA error: device-side assert triggered

What should I do to debug and ensure smooth training
TIA

I launched the code with CUDA_LAUNCH_BLOCKING=1 and the stack trace is as follows -

<ipython-input-1-0a71a9a41b52> in <module>
   1137         print(mask.shape)
   1138         with autocast():
-> 1139             loss = model((image, mask))
   1140             print(loss)
   1141 

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

<ipython-input-1-0a71a9a41b52> in forward(self, input)
    553             out = self.segaHead(out)
    554 
--> 555             focal = self.focalLoss(out, mask)
    556             dice = dice_loss(mask, out)
    557 

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

<ipython-input-1-0a71a9a41b52> in forward(self, input, target)
     64 
     65         # compute the negative likelyhood
---> 66         logpt = -F.cross_entropy(input, target)
     67         pt = torch.exp(logpt)
     68 

/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2466     if size_average is not None or reduce is not None:
   2467         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   2469 
   2470 

/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:
-> 2264         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   2265     elif dim == 4:
   2266         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:115

and my focal loss definition is -


class FocalLoss2d(nn.Module):

    def __init__(self, gamma=0, weight=None, size_average=True):
        super(FocalLoss2d, self).__init__()

        self.gamma = gamma
        self.weight = weight
        self.size_average = size_average

    def forward(self, input, target):
        if input.dim()>2:
            input = input.contiguous().view(input.size(0), input.size(1), -1)
            input = input.transpose(1,2)
            input = input.contiguous().view(-1, input.size(2)).squeeze()
        if target.dim()==4:
            target = target.contiguous().view(target.size(0), target.size(1), -1)
            target = target.transpose(1,2)
            target = target.contiguous().view(-1, target.size(2)).squeeze()
        elif target.dim()==3:
            target = target.view(-1)
        else:
            target = target.view(-1, 1)

        # compute the negative likelyhood
        logpt = -F.cross_entropy(input, target)
        pt = torch.exp(logpt)

        # compute the loss
        loss = -((1-pt)**self.gamma) * logpt

        # averaging (or not) loss
        if self.size_average:
            return loss.mean()
        else:
            return loss.sum()

What exactly is throwing the error - ?
Added this as a comment so as to not make original post too long

Most likely your targets contain invalid values outside of the expected range [0, nb_classes-1].
I’m not sure, if you are using an older PyTorch version, as the stacktrace with blocking launches should show the indexing errors.
In case you are already using the latest release, you could use TORCH_SHOW_CPP_STACKTRACES=1 to hopefully get a more detailed stacktrace or run the code on the CPU to get a better error.

Thank you for replying.
I will try it out.

PS - That was indeed the problem, I counted the number of classes wrong :sweat_smile: