What may cause Runtime Error in grid_sample?

Hi,

I am using pytorch 1.0.0 and 1.0.1.post2 on ubuntu 16.04

I use 5d tensor for grid_sample, my code is like below

for data in dataloader:
    heatmap = model(data)
    grid_flow = f(data)
    new_heatmap = F.grid_sample(heatmap, grid_flow)

this code run smoothly on most of the data, but at certain data(3 out of about 1500), it will cause:

Traceback (most recent call last):
  File "..........py", line 401, in <module>
    cam_R = meta['camera_R'].to(dev)
RuntimeError: CUDA error: an illegal memory access was encountered

I have already set CUDA_LAUNCH_BLOCKING to ‘1’, but it just cannot catch the right place where the error happens.

And I also tried comment new_heatmap = F.grid_sample(heatmap, grid_flow), there will be no error.
So, I am sure that the error occurred in F.grid_sample.


Update 1:
I change grid_sample mode to nearest, new_heatmap = F.grid_sample(heatmap, grid_flow, mode='nearest') there is no runtime error.


I just cannot determine what may cause this runtime error,

Any idea or help will be appreciated!

Thanks,
Zhe

2 Likes

Do you get the same error if you run your code on the CPU?
The error message might be a bit clearer then.

Thanks for your advice, I run grid_flow in cpu mode with

new_heatmap = F.grid_sample(heatmap.cpu(), grid_flow.cpu() )

It will not cause any error on the same data which will cause CUDA Runtime error.

This might caused by Oout of memory.
I run the code in debug mode, and step on each line, the gpu memory usage start to increase and will result in

find_frame: frame not found.
    Looking for thread_id:pid_20017_id_140663683239608, frame_id:140663360355304
    Current     thread_id:pid_20017_id_140663683239608, available frames:
    94804309553960  -  94804309571304  -  140648434923752  -  140663064649256  -  94804081775976


    find_frame: frame not found.
    Looking for thread_id:pid_20017_id_140663683239608, frame_id:140663360355304
    Current     thread_id:pid_20017_id_140663683239608, available frames:
    94804309553960  -  94804309571304  -  140648434923752  -  140663064649256  -  94804081775976

RuntimeError: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0; 10.91 GiB total capacity; 9.78 GiB already allocated; 352.44 MiB free; 205.70 MiB cached)

But it still is weird that if I run this code in run mode, the error just happened in grid_sample().
With help of torch.cuda.memory_allocated(dev), torch.cuda.memory_cached(dev), I can see the allocated and cached memory are 2,733,509,632 and 5,054,136,320 just before execution of grid_sample, then the error RuntimeError: CUDA error: an illegal memory access was encountered occurs.

Still, if use mode=nearest in grid_sample(), no error occurs

Thanks for the debugging!
Do you see this error, if you lower the memory usage, e.g. with a smaller batch size?

Yep,
the origin batch size is 2, and the error occurs at iter 422.
Now, I decrease batch size to 1, and the error occurs at iter 844

I set a breakpoint just before grid_sample in iter 844, at the breakpoint, GPU memory usage is 3083MB,
after execution of grid_sample(),

844it [05:01,  3.73it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=77 : an illegal memory access was encountered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered

and the memory usage is also 3083MB in nvidia-smi

1 Like

https://1drv.ms/f/s!Ai2nc20bhVzugYI_SABdx2bYq-tp7w

I pickle save the heatmap and grid and write code to reproduce it.

I run it in pycharm with below option ticked

it will throw an error:

torch.Size([4, 1, 16, 100, 4096])
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=238 error=77 : an illegal memory access was encountered

but if i run it without the option ticked, the error msg will not shown. Also not shown in debug mode or python xxx.py
I am not sure if there is no error msg or not shown.

I think the error you’re getting might be related to this issue

If so, the problem probably occurs whenever your predicted grid happens to contain an infinite or NaN value. Perhaps your model’s predictions blow up during training, or else you are dividing somewhere by a value that can sometimes be 0?

We will hopefully be able to fix this issue in grid_sample to handle these values without crashing, but in the mean time, my best recommendation for a workaround is to check for any NaN or infinite values in your grid (perhaps using torch.isfinite) and replacing them with some stand-in value before passing to grid_sample.

2 Likes

I am getting this error when I try to access grid values that are out of range, in cuda but not in cpu.

x = torch.randn(1, 3, 64, 64)
flow = torch.nn.Tanh()(torch.randn(1, 64, 64, 2))
pred_x = F.grid_sample(x, flow)  # no problem
pred_x = F.grid_sample(x, 2*flow)  # no problem
pred_x = F.grid_sample(x.to(torch.device('cuda')), flow.to(torch.device('cuda'))) # no problem
pred_x = F.grid_sample(x.to(torch.device('cuda')), 2*flow.to(torch.device('cuda')))
# RuntimeError: CUDA error: an illegal memory access was encountered

I am using Pytorch 1.4.0.

I cannot reproduce this issue in the latest nightly.
Could you update to the nightly binary (use a new conda environment, if necessary) and rerun the code, please?

Funnily enough, I am not able to reproduce this in stable or nightly any more.

I first encountered this while training a model to predict flow and warp an image as above. At one point, it gave me the “illegal memory access error”. And that error sustained in later commands with cuda tensors. Even after I restarted the python environment, data transfer from CPU to GPU (using .to(torch.device('cuda'))) was extremely slow. And my screen display was getting stuck repeatedly (same GPU is handling display on my laptop). I have rebooted my laptop, and now things are working fine again. (Just to let the information be recorded)

Still not sure why the “illegal memory access error” was encountered though. Do you think it might be because it encountered a nan or inf value in flow? Thoughts?

This might be the case, but I’m also unable to reproduce it using NaNs for x and flow.
A NaN value should not cause an illegal memory access, but should rather yield a proper error message, so please let us know, if you are able to reproduce this behavior and we should fix it.