How do I debug CUDNN_STATUS_EXECUTION_FAILED

I got the following error when running my PyTorch code. My understanding is that this error comes from my “gather” function. However, the error message is not helpful for me to debug my code and I do not know exactly which line this error occurs. Is there any trick I can use to debug this kind of error?

The weird thing is that this error sometimes happens and sometimes does not happen, which makes it harder to debug and even to reproduce the bug.

File “./model.py”, line 435, in update
loss.backward()
File “/home/zewei/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/zewei/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED
/pytorch/aten/src/THC/THCTensorScatterGather.cu:124: void > THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [96,0,0] Assertion indexValue &gt;= 0 &amp;&amp; indexValue &lt; tensor.sizes[dim] failed.

1 Like

I am encountering the same problem
Did you solve it?

I’ve had this problem because of using topk(). topk with “nan” is kind of undefined behavior, please refer
github/issues and pytorch/pytorch#1810. Hoping this helps.