CUDA error: an illegal memory access while training a deep learining model

ShubhamAbhayDeshpand · April 5, 2023, 8:24pm

I am trying to train a deep learning model on a custom dataset for semantic segmentation.

When I am trying to train on my PC, I am getting the following error.

 File "tools/train.py", line 223, in <module>
    main()
  File "tools/train.py", line 185, in main
    train(config, epoch, config.TRAIN.END_EPOCH, 
  File "/home/deshpand/thesis_rr/semantic_segmentation_network/PIDNet/tools/../utils/function.py", line 43, in train
    losses, _, acc, loss_list = model(images, labels, bd_gts)
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/thesis_rr/semantic_segmentation_network/PIDNet/tools/../utils/utils.py", line 48, in forward
    loss_s = self.sem_loss(outputs[:-1], labels)
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/thesis_rr/semantic_segmentation_network/PIDNet/tools/../utils/criterion.py", line 90, in forward
    return sum([
  File "/home/deshpand/thesis_rr/semantic_segmentation_network/PIDNet/tools/../utils/criterion.py", line 91, in <listcomp>
    w * func(x, target)
  File "/home/deshpand/thesis_rr/semantic_segmentation_network/PIDNet/tools/../utils/criterion.py", line 72, in _ohem_forward
    pred, ind = pred.contiguous().view(-1,)[mask].contiguous().sort()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

This is the first time I have seen error like this. Can someone please explain what is going on here?

The specifications for my GPU are as follows.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 25%   36C    P0    29W / 120W |    648MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       902      G   /usr/lib/xorg/Xorg                245MiB |
|    0   N/A  N/A      1234      G   /usr/bin/kwin_x11                 123MiB |
|    0   N/A  N/A      1289      G   /usr/bin/plasmashell               48MiB |
|    0   N/A  N/A      1481      G   /usr/lib/firefox/firefox          173MiB |
|    0   N/A  N/A      5801      G   ...RendererForSitePerProcess       49MiB |
+-----------------------------------------------------------------------------+

I will really appreciate the help here. Also, I am using PyTorch version 1.13.1

>>> print(torch.__version__)
1.13.1
>>>

The graphics card model is GeForce GTX 1080Ti (6GB model)

ptrblck · April 5, 2023, 9:49pm

Could you update to the latest stable or nightly release and check if you are still seeing the same error, please?

ShubhamAbhayDeshpand · April 6, 2023, 2:29pm

I tried with nightly build, getting the same error.

Torch build that I used.

>>> import torch
>>> print(torch.__version__)
2.1.0.dev20230405+cu117
>>>

Error image

ptrblck · April 6, 2023, 10:24pm

Thank you for checking! Could you post a minimal and executable code snippet to reproduce the issue, please?

ShubhamAbhayDeshpand · April 9, 2023, 9:56am

I am posting the image and the code snippet for the line where this error is encountered. In the image below, pred.contiguous() is the line where the error occurs. I have tried to replace this command with torch.as_strided() but, that did not work out. I am trying to either rewrite this code for calculating the loss completely or use a different function in pytorch to define the same loss.

class OhemCrossEntropy(nn.Module):
    def __init__(self, ignore_label=-1, thres=0.7,
                 min_kept=100000, weight=None):
        super(OhemCrossEntropy, self).__init__()
        self.thresh = thres
        self.min_kept = max(1, min_kept)
        self.ignore_label = ignore_label
        self.criterion = nn.CrossEntropyLoss(
            weight=weight,
            ignore_index=ignore_label,
            reduction='none'
        )

    def _ce_forward(self, score, target):


        loss = self.criterion(score, target)

        return loss

    def _ohem_forward(self, score, target, **kwargs):

        pred = F.softmax(score, dim=1)
        print(type(pred))
        print('size of pred: ', pred.numel())
        pixel_losses = self.criterion(score, target).contiguous().view(-1)
        mask = target.contiguous().view(-1) != self.ignore_label

        tmp_target = target.clone()
        tmp_target[tmp_target == self.ignore_label] = 0
        pred = pred.gather(1, tmp_target.unsqueeze(1))
        #pred, ind = torch.as_strided(pred, (1, pred.numel()), (0, pred.numel())).view(-1)[mask]  # Changing contiguous with as_strided does not work because the memory is not sufficient to store this. 
        print('program executing until this point.')
        pred, ind = pred.contiguous().view(-1,)[mask].contiguous().sort()       
        min_value = pred[min(self.min_kept, pred.numel() - 1)]
        threshold = max(min_value, self.thresh)

        pixel_losses = pixel_losses[mask][ind]
        pixel_losses = pixel_losses[pred < threshold]
        return pixel_losses.mean()

ptrblck · April 9, 2023, 5:48pm

The contiguous() operation is most likely not failing but is re-raising a sticky CUDA error which also corrupts the CUDA context.
To further debug this issue we would need to get an executable code snippet to reproduce the issue.