CUDA Illegal Memory Access

chinmay5 · June 12, 2020, 7:11am

While trying to implement a backward pass, I keep getting the error of CUDA Illegal Memory Access.

@staticmethod
    def backward(ctx, grad_output):
         grad_label = grad_output.clone()
        num_ft = grad_output.shape[0]
        # grad_label.data.resize_(num_ft, 32, 41)
        lin_indices_3d, lin_indices_2d = ctx.saved_variables
        num_ind = lin_indices_3d.data[0]
        grad_label.data.view(num_ft, -1).index_copy_(1, lin_indices_2d.data[1:1 + num_ind],
                                                     torch.index_select(grad_output.data.contiguous().view(num_ft, -1),
                                                                        1, lin_indices_3d.data[1:1 + num_ind]))
        # raw_input('sdflkj')
        return grad_label, None, None, None

I tried using pdb to see what might be the possible cause

I am not sure what is wrong in the implementation here. Any help would be highly appreciated.

I am using PyTorch 1.3 but the same error persists on 1.4 and 1.5

tr_arun · June 12, 2020, 7:38am

Hi,

Illegal memory access error occur when your program is trying to access an memory location for which the program does not have permission to access.

By setting CUDA_LAUNCH_BLOCKING=1 , you can see where the error comes from.

chinmay5 · June 12, 2020, 7:42am

When I run with the CUDA_LAUNCH_BLOCKING=1 I get the error

Any idea what could be the reason?

ptrblck · June 12, 2020, 9:47am

Could you install the nightly binary (in a new virtual environment) and rerun the code?
If you are still running into this error, could you post a code snippet to reproduce this issue, please?

chinmay5 · June 12, 2020, 9:54am

@ptrblck I will start on that. I have another question though. Digging around it seems that the issue was not present in PyTorch 1.2 . I thought of downgrading to PyTorch 1.2 but as soon as I do that, I would get an error of

For one of the PyBind modules. This error was not there when I worked with PyTorch 1.3 and above. Any ideas about this?

ptrblck · June 12, 2020, 10:04am

You might try to use input_.data<scalar_t>(), but I would recommend to stick to the latest version instead of downgrading.

chinmay5 · June 12, 2020, 11:08am

@ptrblck Same error. What is the general reason for this error? I want to double check that it is not something related to input data before I open a new issue

chinmay5 · June 12, 2020, 1:52pm

@ptrblck I read in one of your previous posts that the error might come when the input tensor is not contiguous and I make sure that the input is contiguous by calling .contiguous() on the input tensor. At this point, I have no idea what is possibly wrong here.

chinmay5 · June 12, 2020, 6:15pm

@ptrblck it works with the nightly version. Thank you so much for your support

ptrblck · June 13, 2020, 2:32am

I think the syntax just changed after 1.3, as this seems to be a compilation error.
Good to know, it’s working with the nightly.