How to debug this backward error?

lxtGH · October 16, 2018, 4:01am

I meet this error as following, is there some actions is not differentiable or in-place operation cause this error. I want to know how to debug this error.
Traceback (most recent call last):
File “training/train_super_pixel_net.py”, line 370, in
main(args)
File “training/train_super_pixel_net.py”, line 103, in main
avg_loss = train(args, model, optimizer, epoch, trainloader, scheduler, best_iou, loss_fn)
File “training/train_super_pixel_net.py”, line 151, in train
loss.backward()
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py”, line 76, in apply
return self._forward_cls.backward(self, *args)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 30, in backward
return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 41, in forward
return comm.reduce_add_coalesced(grads, destination)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py”, line 119, in reduce_add_coalesced
flat_tensors = [_flatten_dense_tensors(chunk) for chunk in chunks]
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py”, line 119, in
flat_tensors = [_flatten_dense_tensors(chunk) for chunk in chunks]
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/_utils.py”, line 144, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:271

kuba456 · October 16, 2018, 4:45am

This looks like a bug in cuda code: https://github.com/NVIDIA/vid2vid/issues/19
I bet if you run in cpu mode this will work fine.

tom · October 16, 2018, 5:50am

I would probably first try to establish where the exact location is. Due to the asynchronous nature of GPU computing, the error is usually reported later. A while ago, I wrote a quick howto for this. There some operations that may cause errors like this when used with invalid data, e.g. indexing functions when you demand an index that is outside the array.

Best regards

Thomas

lxtGH · October 16, 2018, 6:47am

Hi ! Thanks for you all. It is cuda code problem, it takes me a long time to find the location of the code

NightRainXiaoxiang · March 23, 2019, 1:10pm

I met a similar issue.
Could tell me the reason you found about this issue?
Thanks a lot.

NightRainXiaoxiang · March 24, 2019, 7:56am

Hi! I have add "os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1"” in my code, but it seems that the traceback is still random and useless. Do you have any advice to debug such kind of error?
Thanks for any suggestions！

tom · March 24, 2019, 9:15pm

Did you add it at the very top, though?
It needs to be there when PyTorch is first imported.

Best regards

Thomas

NightRainXiaoxiang · March 25, 2019, 1:17am

Yes, I add it at the very top.

thegopieffect · September 12, 2019, 10:26am

So this worked for me, basically I downgraded from python 3.7 to 3.6.9 and installed pytorch 1.0.0 using pip.

Here’s my config,

Ubuntu 18.04 LTS
CUDA 10.2
CuDNN 7.6.1
Nvidia TITAN RTX x4