CUDA error: an illegal memory access was encountered

Hi, all. I am getting a weird illegal memory access error whenever I try to train a FasterRCNN model with an image size of (1280,840,3) and a batch size of 3. The GPU used is Tesla K80 with CUDA 10.1 on an Ubuntu OS. I am Pytorch 1.5 and torchvision 0.6 Given below is the code snippet.

def from_numpy_to_tensor(images,labels_list):

    images = torch.from_numpy(images).cuda()
    for label in labels_list:
        label["boxes"] = torch.from_numpy(label["boxes"]).cuda()
        label["labels"] = torch.from_numpy(label["labels"]).cuda()

    return images,labels_list

class CustomDataset(

    def __init__(self,xtr,ytr):

        self.xtr = xtr
        self.ytr = ytr

    def __getitem__(self,idx):

        img = self.xtr[idx]
        tar = self.ytr[idx]

        return img,tar

    def __len__(self):

        return len(self.xtr)

def collate_fn(batch):
    return list(zip(*batch))

device = torch.device("cuda:0")
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False,num_classes=4)
#model = nn.DataParallel(model,device_ids=[0,1,2,3])
optimizer = optim.Adam(model.parameters(),lr=0.000001)

x_train,y_train = from_numpy_to_tensor(images,labels)
dataset = CustomDataset(x_train,y_train)
dataloader = DataLoader(dataset,batch_size=3,collate_fn=collate_fn)

for i in range(epochs):
    logs = train_one_epoch(model,optimizer,dataloader,device,i,10)

The error I get is the following.

Traceback (most recent call last):
  File "", line 125, in <module>
    logs = train_one_epoch(model,optimizer,dataloader,device,i,10)
  File "/workspace/Pytorch tutorials/", line 46, in train_one_epoch
  File "/opt/conda/lib/python3.7/site-packages/torch/", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (launch_kernel at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/cuda/CUDALoops.cuh:112)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f685af93b5e in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #1: void at::native::gpu_index_kernel<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_put_kernel_impl<at::native::OpaqueType<4> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_put_kernel_impl<at::native::OpaqueType<4> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> const&) + 0x797 (0x7f685d77d227 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #2: <unknown function> + 0x25b9a64 (0x7f685d779a64 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #3: <unknown function> + 0xb610cf (0x7f6882cd60cf in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::ArrayRef<at::Tensor>, at::Tensor const&, bool, bool) + 0x491 (0x7f6882cd3901 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #5: <unknown function> + 0xee23de (0x7f68830573de in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #6: at::native::index_put_(at::Tensor&, c10::ArrayRef<at::Tensor>, at::Tensor const&, bool) + 0x135 (0x7f6882cc3255 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #7: <unknown function> + 0xee210e (0x7f688305710e in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #8: <unknown function> + 0x288fa88 (0x7f6884a04a88 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #9: torch::autograd::generated::IndexPutBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x251 (0x7f68847cf201 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #10: <unknown function> + 0x2ae8215 (0x7f6884c5d215 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #11: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f6884c5a513 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #12: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f6884c5b2f2 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #13: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f6884c53969 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #14: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f6887f9a558 in /opt/conda/lib/python3.7/site-packages/torch/lib/
frame #15: <unknown function> + 0xc819d (0x7f688a9fd19d in /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././
frame #16: <unknown function> + 0x76db (0x7f68a2f046db in /lib/x86_64-linux-gnu/
frame #17: clone + 0x3f (0x7f68a2c2d88f in /lib/x86_64-linux-gnu/

The train_one_epoch is taken from here. I have already tried this os.environ["CUDA_LAUNCH_BLOCKING"] = "1". But this hasn’t made any difference as such. Is there a something wrong in this code?


I tried to reproduce this issue using this code snippet:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True).cuda()
images, boxes = torch.rand(4, 3, 840, 1280).cuda(), torch.rand(4, 11, 4).cuda()

labels = torch.randint(1, 91, (4, 11)).cuda()

images = list(image for image in images)
targets = []
for i in range(len(images)):
    d = {}
    d['boxes'] = boxes[i]
    d['labels'] = labels[i]
output = model(images, targets)

which works fine on my machine.
Could you post an executable code snippet using random shapes, which triggers this issue, please?

I don’t know how but when I tried executing the code with random numbers as images with size of (1280,840,3) and labels with random shapes the error doesn’t come up. It runs flawless. When I run it with my dataset with after running for 25-30 epochs it throws the illegal memory access error . But one thing that I have observed is when I reduce the image size to (640,640,3) the execution stops at 57th - 59th epoch with the before mentioned error. I am still not able to make out why this is happening.

Are you using the same image shape for all inputs?
Note that the expected image tensor shape is [batch_size, channels, height, width], but I assume you are already passing it in this shape.

Yes the image shape is same shape for all inputs. I am reshaping it all to one value and then sending it to train

What could be the possible differences between the random inputs and your image data?
I.e. could your real image data take different paths inside the model?

I will try with some other dataset and check whether the same error is popping up.

I have one question. In the num_classes argument do we have to include Background also as a class. Because originally I have 4 classes. When I run with 5 classes (including background) it does not throw me any error.

Yes, FasterRCNN expects the num_classes as:

num_classes (int): number of output classes of the model (including the background).

That’s a good hint by the way. Maybe the illegal memory access is created in a custom CUDA extension used by this model, so we would need to dig into it.

How exactly did you setup the classes, so that we could try to reproduce it?

Actually, I tried to run the training entirely on CPU. When I started the training on CPU, I got the following error.

Traceback (most recent call last):
  File "", line 150, in <module>
    output = model(x_tr,y_tr)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/models/detection/", line 71, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/models/detection/", line 763, in forward
    class_logits, box_regression, labels, regression_targets)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/models/detection/", line 36, in fastrcnn_loss
    classification_loss = F.cross_entropy(class_logits, labels)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/", line 2317, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/", line 2115, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target 4 is out of bounds.

This is when I rectified the num_classes to 5 instead of 4 with background as a class which solved my problem.

I was using the function below to get boxes and labels

def get_boxes_and_labels(label_file,class_map):

    target = {}
    target["boxes"] = []
    target["labels"] = []
    for anno in label_file:

        if "tag" in anno.keys() and anno["tag"] in class_map:
            coord = []
            coord.append(coord[0] + anno["size"]["x"])
            coord.append(coord[1] + anno["size"]["y"])


    target["boxes"] = np.array(target["boxes"],dtype="float32")
    target["labels"] = np.array(target["labels"])
    return target

The class map was class_map = {"freezer":1,"Chiller":2,"SLAB":3,"Vegbox":4} .

Thanks for the follow-up! So indeed the illegal memory access might have been raised due to the missing background class.

Could you please post your PyTorch version?

Yeah sure. This is the entire env.

CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80
GPU 4: Tesla K80
GPU 5: Tesla K80
GPU 6: Tesla K80
GPU 7: Tesla K80

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0a0+82fd1c8
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.1.243             h6bb024c_0  
[conda] mkl                       2020.0                      166  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.0.15           py37ha843d7b_0  
[conda] mkl_random                1.1.0            py37hd6b4f25_0  
[conda] numpy                     1.18.1           py37h4f9e942_0  
[conda] numpy-base                1.18.1           py37hde5b4d6_1  
[conda] pytorch                   1.5.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchvision               0.6.0                py37_cu101    pytorch
1 Like

@ptrblck @chhaya_kumar_das But how do you explain that the error occurs after several epochs? All data points have been processed already at that point at least once without illegal memory error, right? I’m having this error too.

1 Like

No idea buddy…I tried to search on it a bit but couldn’t find anything

I am facing a similar issue while training with large tensors. The behaviour is not deterministic though. Varying (aka reducing) the batch size and the seed, the issue disappears in most of the cases. At the beginning I though it was due to insufficient memory capacity but then realized the two things are not related. I am basically stuck and now wondering how to effectively debug the code. Any thought/suggestion? The cuda-memcheck tool hasn’t helped much.

Are you using a custom extension, or did you just run cuda-memcheck on the complete model?
Due to the asynchronous execution of CUDA code, you could rerun the code with CUDA_LAUNCH_BLOCKING=1 python args and post the stack trace here, so that we could have a look.

1 Like

I just wanted to add that I was getting this error with torch 1.6 and cuda 10.2. I upgraded torch to 1.7 and the error changed to show me that one device was on cpu and the other was on GPU which was easily solved.

So it appears it might be happening somewhere in the error handling which would usually display a meaningful error, but some cuda/torch mismatch or bug crashes in the process.

I’m facing the same problem now.When I took a part of my dataset and trained it, the error went away. I really can’t see what’s wrong with it

If you are already using the latest stable release or the nightly binary, did you try to run the script with CUDA_LAUNCH_BLOCKING=1 as suggested and could post the stack trace?

My environment : cuda=10.1.243 cudnn=764 pytorch=1.3.1
The error disappeared when I changed the cudnn version to 765 .