RuntimeError: CUDA error: an illegal memory access was encountered

NightRainXiaoxiang · June 15, 2019, 3:17am

Hi,everyone!
I met a strange illegal memory access error. It happens randomly without any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don’t think there is anything wrong in the code.

import torch
import torch.nn as nn
import torch.nn.functional as F
import os
class InstanceSeg(nn.Module):
    def __init__(self, num_points=1024):
        super(InstanceSeg, self).__init__()

        self.num_points = num_points

        self.conv1 = nn.Conv1d(9, 64, 1)
        self.conv2 = nn.Conv1d(64, 64, 1)
        self.conv3 = nn.Conv1d(64, 64, 1)
        self.conv4 = nn.Conv1d(64, 128, 1)
        self.conv5 = nn.Conv1d(128, 1024, 1)
        self.conv6 = nn.Conv1d(1088, 512, 1)
        self.conv7 = nn.Conv1d(512, 256, 1)
        self.conv8 = nn.Conv1d(256, 128, 1)
        self.conv9 = nn.Conv1d(128, 128, 1)
        self.conv10 = nn.Conv1d(128, 2, 1)
        self.max_pool = nn.MaxPool1d(num_points)

    def forward(self, x):
        batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points))

        out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points))
        point_features = out

        out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
        global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1))

        global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points))
        out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points))

        out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points))
        out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points))
        out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points))

        out = self.conv10(out) # (shape: (batch_size, 2, num_points))

        out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2))
        out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2))
        out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2))

        return out

Num = 0
network = InstanceSeg()
network.cuda()
while(1):

    input0 = torch.randn(32, 3, 1024).cuda()
    input1 = torch.randn(32, 3, 1024).cuda()
    input2 = torch.randn(32, 3, 1024).cuda()
    input = torch.cat((input0, input1, input2), 1)

    out = network(input)
    Num = Num+1
    print(Num)

After random number of steps, error raises. The error report is

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module>
    input0 = torch.randn(32, 3, 1024).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

When I added “os.environ[‘CUDA_LAUNCH_BLOCKING’] = ‘1’” at the top of this script, the error report was changed to this

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module>
    out = network(input)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward
    out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.
I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.

This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
GPU: Titan XP
Driver Version: 410.93
Python Version: 3.6
cuda Version: cuda_9.0.176_384.81_linux
cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24
pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2

I have been stuck here for long time. I don’t think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.
Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.

ptrblck · June 15, 2019, 11:59am

Does your code run fine on the CPU?
Would it be possible to test it in a reasonable amount of time?

NightRainXiaoxiang · June 15, 2019, 12:18pm

Hi，prtblck!
Thanks for your reply.
I have run my code in cpu for about half an hour, no error happened.

ptrblck · June 15, 2019, 12:23pm

Thanks for the information!
Do you have any other GPU available to test it against your Titan XP?

Recently, @pinouchon reported in this topic about similar issues using his GPU. In the end he realized that the hardware was broken (maybe by the pre-owner).

NightRainXiaoxiang · June 15, 2019, 1:06pm

I don’t have another GPU. I am not sure if my GPU is broken, is there anyway to diagnose it? The graphic card can output video correctly.
I have a windows 10 system in the same computer.
The environment information is

OS:  windows 10 64bit
Driver Version: 411.31
cuda Version: 10.0
cudnn Version: 7.4
pytorch Version: 1.0.1

But it seems that this error doesn’t raise under such an environment, so I don’t think this error is caused by hardware.
Is there any possibility that this error is caused by the incompatibility of software version? Maybe driver version, cuda version and cudnn version?

ptrblck · June 15, 2019, 1:43pm

That’s good to know.
Could install PyTorch with CUDA10 and check it again?
The Pascal architecture should also work with CUDA9, but it looks like CUDA10 is running fine on your Windows OS.

pinouchon · June 15, 2019, 3:09pm

I had a similar issue with a TCN network: semi-random CUDA error after a random number of steps.
I probably have a bad gpu, tough I still not 100% sure. I would say I’m 90% sure the GPU is faulty, but that GPU can still train RNN models fine…

My suggestions:

If you find a way to trigger the error faster (ideally since the first batch), that would help tremendously for debugging. In your training runs where the error happened faster, what did you change specifically (batch size, regularisation, architecture?), try fiddling with those params to get the error to trigger faster.
If you have the time/know-how, you could replicate your setup on google cloud (or colab) and by renting a GPU.
Try simplifying your architecture (removing all convs, removing max pooling, removing all relus) and see if the error goes away. The issue is that without point 1., this will take a long time for each run.
Create a “reproduction dump” where you provide a gist or repo with instructions so that it’s easier to find someone willing to reproduce (or not) your issue.
Try giving the network super small inputs, and super large inputs that barely fit on your GPU, and see if the timing of the error changes. If it does, it’s more likely linked to a memory issue. Monitor the run with nvidia-smi -l 1

ptrblck · June 15, 2019, 8:28pm

@pinouchon’s suggestions are nice and could narrow down the issue.

@NightRainXiaoxiang could you also run Furmark as a stress test for your GPU?

NightRainXiaoxiang · June 16, 2019, 2:38am

@ptrblck @pinouchon Thanks for your advises, but maybe I don’t have time to do such a time-consuming test， and I don’t have another available GPU for my task. I will first try to use newer version of software (pytorch 1.1, cuda 10.0, cudnn 7.6) to see if the error happens. It seems that cuda 10.0 works well, but I am not sure. I will post more information here after I do more trials. I hope the information could help to solve such kind of error since it is very hard to find and debug.