Illegal memory access error for simple function

John_Deterious · January 17, 2020, 4:09am

I have this simple function that throws an error whenever any tensor in GPU is sent to it.


def sumlogC( y , eps = 1e-5):
    '''
    Numerically stable implementation of 
    sum of logarithm of Continous Bernoulli
    constant C, using Taylor 2nd degree approximation
        
    Parameter
    ----------
    y : Tensor of dimensions (batch_size, dim)
        y takes values in (0,1)
    ''' 
    x = torch.clamp(y, eps, 1.-eps) 
    mask = torch.abs(x - 0.5).ge(eps)
    far = torch.masked_select(x, mask)
    close = torch.masked_select(x, ~mask)
    far_values =  torch.log( (torch.log(1. - far) - torch.log(far)).div(1. - 2. * far) )
    close_values = torch.log(torch.tensor((2.))) + torch.log(1. + torch.pow( 1. - 2. * close, 2)/3. )
    return far_values.sum() + close_values.sum()

ptrblck · January 17, 2020, 9:10am

The code snippet works for multiple runs of:

y = torch.empty(10, 10).uniform_().cuda()
sumlogC(y)

Could you share some input, which creates the error?
Also, which PyTorch version are you using?

John_Deterious · January 17, 2020, 9:54am

I have just upgraded to latest 1.4 version. Now it runs in a for loop smoothly and then suddenly shows this error again. Some times after 10 batches, some times after 400, completely unpredictable behavious, everything else is fixed.

What I have is a VAE, vanilla version, working smoothly and giving good results. Then I added that output of the function above to ELBO loss (acording to 2019 paper), that’s all the change I did. And I started getting that CUDA error.

Once it happens, the entire GPU becomes inaccessible, nothing can be put there, and nothing there can be accessed. Here’s error when I ask for a tensor stored in GPU after the error:

Traceback (most recent call last):
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\core\formatters.py”, line 224, in catch_format_error
r = method(self, *args, **kwargs)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\core\formatters.py”, line 702, in call
printer.pretty(obj)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\lib\pretty.py”, line 402, in pretty
return _repr_pprint(obj, self, cycle)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\lib\pretty.py”, line 697, in _repr_pprint
output = repr(obj)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch\tensor.py”, line 159, in repr
return torch._tensor_str._str(self)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 311, in _str
tensor_str = _tensor_str(self, indent)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 209, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in get_summarized_data
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 244, in get_summarized_data
return torch.stack([get_summarized_data(x) for x in self])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 244, in
return torch.stack([get_summarized_data(x) for x in self])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in get_summarized_data
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 235, in get_summarized_data
return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:]))
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at …\aten\src\THC\THCCachingHostAllocator.cpp:278

John_Deterious · January 17, 2020, 3:12pm

Hi @ptrblck I bypassed the problem with a bandage-aide solution: send the tensor to CPU, run the function, send the result back to GPU.

It’s an annoying issue with no obvious solution on the internet. I can work on code that reproduces the issue if that is useful.

John_Deterious · January 17, 2020, 3:59pm

Hi @ptrblck I completely solved the problem. I sigmoided the input to the function, indeed the input is intended to be in [0, 1] range. At any rate, even with this mistake in place, still the error makes no sense and is unpredictable. On CPU you never see such a thing.

Thank you for your time.

ptrblck · January 17, 2020, 6:56pm

I cannot reproduce this error with 10000 runs using values in [-300, 300].
Do you have a script to reproduce this issue?

JVGD · March 30, 2020, 7:54am

Hi! I had exactly the same problem when trying to infer with my torch model parsed into a TensorRT model. @John_Deterious gave me two hints to solve it:

Make sure the input is in range of what your model expects
Make sure your input is in the same device your model is

I was having this:

model_input = torch.rand((1, 3, 416, 416))
y = model(model_input)
model_trt = to_trt(model, model_input)
y_trt = model_trt(model_input)

But my model expects a normalized input in range [0, 1] and the model_trt was in device device='cuda:0' So my changes were basically to change the input tensor to be in the expected range and on the GPU (same device):

model_input = torch.randn((1, 3, 416, 416)).clamp(0, 1).cuda()
y = model(model_input)
model_trt = parse_model_trt(model, model_input)
y_trt = model_trt(model_input)