What may cause Runtime Error in grid_sample?


#1

Hi,

I am using pytorch 1.0.0 and 1.0.1.post2 on ubuntu 16.04

I use 5d tensor for grid_sample, my code is like below

for data in dataloader:
    heatmap = model(data)
    grid_flow = f(data)
    new_heatmap = F.grid_sample(heatmap, grid_flow)

this code run smoothly on most of the data, but at certain data(3 out of about 1500), it will cause:

Traceback (most recent call last):
  File "..........py", line 401, in <module>
    cam_R = meta['camera_R'].to(dev)
RuntimeError: CUDA error: an illegal memory access was encountered

I have already set CUDA_LAUNCH_BLOCKING to ‘1’, but it just cannot catch the right place where the error happens.

And I also tried comment new_heatmap = F.grid_sample(heatmap, grid_flow), there will be no error.
So, I am sure that the error occurred in F.grid_sample.


Update 1:
I change grid_sample mode to nearest, new_heatmap = F.grid_sample(heatmap, grid_flow, mode='nearest') there is no runtime error.


I just cannot determine what may cause this runtime error,

Any idea or help will be appreciated!

Thanks,
Zhe


#2

Do you get the same error if you run your code on the CPU?
The error message might be a bit clearer then.


#3

Thanks for your advice, I run grid_flow in cpu mode with

new_heatmap = F.grid_sample(heatmap.cpu(), grid_flow.cpu() )

It will not cause any error on the same data which will cause CUDA Runtime error.


#4

This might caused by Oout of memory.
I run the code in debug mode, and step on each line, the gpu memory usage start to increase and will result in

find_frame: frame not found.
    Looking for thread_id:pid_20017_id_140663683239608, frame_id:140663360355304
    Current     thread_id:pid_20017_id_140663683239608, available frames:
    94804309553960  -  94804309571304  -  140648434923752  -  140663064649256  -  94804081775976


    find_frame: frame not found.
    Looking for thread_id:pid_20017_id_140663683239608, frame_id:140663360355304
    Current     thread_id:pid_20017_id_140663683239608, available frames:
    94804309553960  -  94804309571304  -  140648434923752  -  140663064649256  -  94804081775976

RuntimeError: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0; 10.91 GiB total capacity; 9.78 GiB already allocated; 352.44 MiB free; 205.70 MiB cached)

But it still is weird that if I run this code in run mode, the error just happened in grid_sample().
With help of torch.cuda.memory_allocated(dev), torch.cuda.memory_cached(dev), I can see the allocated and cached memory are 2,733,509,632 and 5,054,136,320 just before execution of grid_sample, then the error RuntimeError: CUDA error: an illegal memory access was encountered occurs.

Still, if use mode=nearest in grid_sample(), no error occurs


#5

Thanks for the debugging!
Do you see this error, if you lower the memory usage, e.g. with a smaller batch size?


#6

Yep,
the origin batch size is 2, and the error occurs at iter 422.
Now, I decrease batch size to 1, and the error occurs at iter 844

I set a breakpoint just before grid_sample in iter 844, at the breakpoint, GPU memory usage is 3083MB,
after execution of grid_sample(),

844it [05:01,  3.73it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=77 : an illegal memory access was encountered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered

and the memory usage is also 3083MB in nvidia-smi


#7

https://1drv.ms/f/s!Ai2nc20bhVzugYI_SABdx2bYq-tp7w

I pickle save the heatmap and grid and write code to reproduce it.

I run it in pycharm with below option ticked

it will throw an error:

torch.Size([4, 1, 16, 100, 4096])
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=238 error=77 : an illegal memory access was encountered

but if i run it without the option ticked, the error msg will not shown. Also not shown in debug mode or python xxx.py
I am not sure if there is no error msg or not shown.