I am using pytorch 1.0.0 and 1.0.1.post2 on ubuntu 16.04
I use 5d tensor for grid_sample, my code is like below
for data in dataloader:
heatmap = model(data)
grid_flow = f(data)
new_heatmap = F.grid_sample(heatmap, grid_flow)
this code run smoothly on most of the data, but at certain data(3 out of about 1500), it will cause:
Traceback (most recent call last):
File "..........py", line 401, in <module>
cam_R = meta['camera_R'].to(dev)
RuntimeError: CUDA error: an illegal memory access was encountered
I have already set CUDA_LAUNCH_BLOCKING to ā1ā, but it just cannot catch the right place where the error happens.
And I also tried comment new_heatmap = F.grid_sample(heatmap, grid_flow), there will be no error.
So, I am sure that the error occurred in F.grid_sample.
Update 1:
I change grid_sample mode to nearest, new_heatmap = F.grid_sample(heatmap, grid_flow, mode='nearest') there is no runtime error.
I just cannot determine what may cause this runtime error,
This might caused by Oout of memory.
I run the code in debug mode, and step on each line, the gpu memory usage start to increase and will result in
find_frame: frame not found.
Looking for thread_id:pid_20017_id_140663683239608, frame_id:140663360355304
Current thread_id:pid_20017_id_140663683239608, available frames:
94804309553960 - 94804309571304 - 140648434923752 - 140663064649256 - 94804081775976
find_frame: frame not found.
Looking for thread_id:pid_20017_id_140663683239608, frame_id:140663360355304
Current thread_id:pid_20017_id_140663683239608, available frames:
94804309553960 - 94804309571304 - 140648434923752 - 140663064649256 - 94804081775976
RuntimeError: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0; 10.91 GiB total capacity; 9.78 GiB already allocated; 352.44 MiB free; 205.70 MiB cached)
But it still is weird that if I run this code in run mode, the error just happened in grid_sample().
With help of torch.cuda.memory_allocated(dev), torch.cuda.memory_cached(dev), I can see the allocated and cached memory are 2,733,509,632 and 5,054,136,320 just before execution of grid_sample, then the error RuntimeError: CUDA error: an illegal memory access was encountered occurs.
Still, if use mode=nearest in grid_sample(), no error occurs
torch.Size([4, 1, 16, 100, 4096])
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=238 error=77 : an illegal memory access was encountered
but if i run it without the option ticked, the error msg will not shown. Also not shown in debug mode or python xxx.py
I am not sure if there is no error msg or not shown.
I think the error youāre getting might be related to this issue
If so, the problem probably occurs whenever your predicted grid happens to contain an infinite or NaN value. Perhaps your modelās predictions blow up during training, or else you are dividing somewhere by a value that can sometimes be 0?
We will hopefully be able to fix this issue in grid_sample to handle these values without crashing, but in the mean time, my best recommendation for a workaround is to check for any NaN or infinite values in your grid (perhaps using torch.isfinite) and replacing them with some stand-in value before passing to grid_sample.
I am getting this error when I try to access grid values that are out of range, in cuda but not in cpu.
x = torch.randn(1, 3, 64, 64)
flow = torch.nn.Tanh()(torch.randn(1, 64, 64, 2))
pred_x = F.grid_sample(x, flow) # no problem
pred_x = F.grid_sample(x, 2*flow) # no problem
pred_x = F.grid_sample(x.to(torch.device('cuda')), flow.to(torch.device('cuda'))) # no problem
pred_x = F.grid_sample(x.to(torch.device('cuda')), 2*flow.to(torch.device('cuda')))
# RuntimeError: CUDA error: an illegal memory access was encountered
I cannot reproduce this issue in the latest nightly.
Could you update to the nightly binary (use a new conda environment, if necessary) and rerun the code, please?
Funnily enough, I am not able to reproduce this in stable or nightly any more.
I first encountered this while training a model to predict flow and warp an image as above. At one point, it gave me the āillegal memory access errorā. And that error sustained in later commands with cuda tensors. Even after I restarted the python environment, data transfer from CPU to GPU (using .to(torch.device('cuda'))) was extremely slow. And my screen display was getting stuck repeatedly (same GPU is handling display on my laptop). I have rebooted my laptop, and now things are working fine again. (Just to let the information be recorded)
Still not sure why the āillegal memory access errorā was encountered though. Do you think it might be because it encountered a nan or inf value in flow? Thoughts?
This might be the case, but Iām also unable to reproduce it using NaNs for x and flow.
A NaN value should not cause an illegal memory access, but should rather yield a proper error message, so please let us know, if you are able to reproduce this behavior and we should fix it.