RuntimeError: CUDA error: device-side assert triggered - "Index out of bounds" failed

EdinMahmutovic · July 3, 2020, 9:21am

So I have some long code, but I’ll give a snippet of the code that gives problems. I have build a DQN learning agent and I am training it on a Tesla V100 GPU.
This is the line of code where the problem occurs:

q_eval = self.Q_eval.forward(state_batch, state_seq_batch, tensor_batch_index)
q_eval = q_eval[batch_index, action_batch]

and the error message is:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [70,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [71,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [72,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [73,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [74,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):

File “main.py”, line 268, in
agent.learn()
File “/zhome/8b/b/126923/Scripts/Python/Bachelor_Project/agents.py”, line 208, in learn
q_eval = q_eval[batch_index, action_batch]
RuntimeError: CUDA error: device-side assert triggered

I wondered it had to something with the indexing so I have printed out the shape of q_eval and the values of batch_index and action_batch:

q_eval shape:: torch.Size([32, 200])
batch_index = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31]
action_batch = [ 19. 48. 67. 19. 30. 11. 18. 16. 67. 69. 7. 1. 26. 19.
43. 38. 61. 34. 21. 0. 11. 41. 11. 20. 42. 8. 17. 45.
24. 18. 43. 110.]

I don’t see any values where the indexing goes “out out bounds”…
And the worst part about all of this, is that it occurs after 1 hour +/- 5 minutes of training. Meaning I have to wait for quite some time to see if the problem has be dissolved. Another note is, that I cannot recreate the indexing error on a CPU.

ptrblck · July 4, 2020, 3:11am

Did you print the problematic batch, which creates the error or just a random one?
In the latter case you could add an assert statement checking for the invalid indices and print them, in case it’s triggered. This could narrow down the faulty batch and you could dig into the code to figure out, how this invalid index is created.

EdinMahmutovic · July 4, 2020, 7:04am

That is the problematic batch. I printed out every batch to see whether any changes are happening right before the problematic batch or if any unusual happens. But thanks for the tip!

But nothing unusual is happening. The Q_eval is constantly 32 x 200 and both index is constantly a range between 0 to 31 and action batch is always between 0 and 199. Though I have no idea why the error occurs.

And to comment on your idea. it doesn’t really speed up the process of debugging if I only print the faulty batch. This still occurs after around 1 hour which is what bothers me. But nothing changes in the code after 1 hour which makes this question to an open debate…

ptrblck · July 4, 2020, 7:37am

If that’s the problematic batch, I assume you can reproduce the error by feeding these values to the mentioned operations?

EdinMahmutovic · July 4, 2020, 9:32am

That sounded like a great idea.

I fixed the action batch to the fixed array that is given in the description. But it does not reproduce the error instantly… It still fails after 1 hour of training, e.g, there’s no problem in the first 50k iterations… Can the error be something else than an indexing error even though the error message says otherwise?

ptrblck · July 4, 2020, 10:39pm

No, the error message would give you the failing operation.
However, the stack trace might point to the wrong line of code, due to the asynchronous behavior.
You could rerun the code with:

CUDA_LAUNCH_BLOCKING=1 python script.py args

to get the proper stack trace with the offending operation.

EdinMahmutovic · July 5, 2020, 2:42pm

Thank you! I didn’t know that command existed. It pointed to another line of code at which I quickly found the error.

winston-wen · April 29, 2021, 6:26am

Your advice help me find the exact location of error, which is really life-saving!
In my case, I created a boolean indexer like the following snippet, and I forgot to check the case that the indexer can be all False!

valid = distmat <= 10
tp = pred[valid, :]
blabla