So I have some long code, but I’ll give a snippet of the code that gives problems. I have build a DQN learning agent and I am training it on a Tesla V100 GPU.
This is the line of code where the problem occurs:
q_eval = self.Q_eval.forward(state_batch, state_seq_batch, tensor_batch_index)
q_eval = q_eval[batch_index, action_batch]
and the error message is:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [70,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [71,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [72,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [73,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [74,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
Traceback (most recent call last):
File “main.py”, line 268, in
agent.learn()
File “/zhome/8b/b/126923/Scripts/Python/Bachelor_Project/agents.py”, line 208, in learn
q_eval = q_eval[batch_index, action_batch]
RuntimeError: CUDA error: device-side assert triggered
I wondered it had to something with the indexing so I have printed out the shape of q_eval and the values of batch_index and action_batch:
q_eval shape:: torch.Size([32, 200])
batch_index = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31]
action_batch = [ 19. 48. 67. 19. 30. 11. 18. 16. 67. 69. 7. 1. 26. 19.
43. 38. 61. 34. 21. 0. 11. 41. 11. 20. 42. 8. 17. 45.
24. 18. 43. 110.]
I don’t see any values where the indexing goes “out out bounds”…
And the worst part about all of this, is that it occurs after 1 hour +/- 5 minutes of training. Meaning I have to wait for quite some time to see if the problem has be dissolved. Another note is, that I cannot recreate the indexing error on a CPU.