How to debug this backward error?

I meet this error as following, is there some actions is not differentiable or in-place operation cause this error. I want to know how to debug this error.
Traceback (most recent call last):
File “training/train_super_pixel_net.py”, line 370, in
main(args)
File “training/train_super_pixel_net.py”, line 103, in main
avg_loss = train(args, model, optimizer, epoch, trainloader, scheduler, best_iou, loss_fn)
File “training/train_super_pixel_net.py”, line 151, in train
loss.backward()
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py”, line 76, in apply
return self._forward_cls.backward(self, *args)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 30, in backward
return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 41, in forward
return comm.reduce_add_coalesced(grads, destination)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py”, line 119, in reduce_add_coalesced
flat_tensors = [_flatten_dense_tensors(chunk) for chunk in chunks]
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py”, line 119, in
flat_tensors = [_flatten_dense_tensors(chunk) for chunk in chunks]
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/_utils.py”, line 144, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:271

This looks like a bug in cuda code: https://github.com/NVIDIA/vid2vid/issues/19
I bet if you run in cpu mode this will work fine.

1 Like

I would probably first try to establish where the exact location is. Due to the asynchronous nature of GPU computing, the error is usually reported later. A while ago, I wrote a quick howto for this. There some operations that may cause errors like this when used with invalid data, e.g. indexing functions when you demand an index that is outside the array.

Best regards

Thomas

3 Likes

Hi ! Thanks for you all. It is cuda code problem, it takes me a long time to find the location of the code

I met a similar issue.
Could tell me the reason you found about this issue?
Thanks a lot.

Hi! I have add "os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1"” in my code, but it seems that the traceback is still random and useless. Do you have any advice to debug such kind of error?
Thanks for any suggestions!

Did you add it at the very top, though?
It needs to be there when PyTorch is first imported.

Best regards

Thomas

Yes, I add it at the very top.

So this worked for me, basically I downgraded from python 3.7 to 3.6.9 and installed pytorch 1.0.0 using pip.

Here’s my config,

  1. Ubuntu 18.04 LTS
  2. CUDA 10.2
  3. CuDNN 7.6.1
  4. Nvidia TITAN RTX x4