So I ran into the same error message for a similar reason just now, and I think it makes sense to add in some error handling to catch it. Feel free to skip to the TLDR if you don’t have care for STORYTIME.
I recently re-installed my NVIDIA drivers (someone in the lab broke the server trying to install them, and for the life of me I have no idea what went wrong) and managed to compile the latest PyTorch build from source (which previously had thrown errors related to CUDA, suggesting that PyTorch unsurprisingly doesn’t compile if you’re using old drivers). Yay!
Then I went to run a little piece of code that’s basically just GPU-accelerated numpy, and which I had run about a thousand times previously without error, and it starts throwing the device-side assert error 59. The strangest thing about this was that it was throwing the error on an allocation call–the exact line was
It turned out the issue was in the line above, where I call index_select on a different tensor using a variable that can vary widely–the issue was that I clamped my indices incorrectly:
out = inp.index_select(2, torch.clamp(index, 0, inp.size()[-1])).
Which should obviously be clamped to
0, inp.size()[-1] - 1 since python is zero-indexed. (Sidenote, switching between python and MATLAB daily is awful). Anyhow, there’s two interesting things about this–the first is that while using the previous drivers (361.45) this did not ever throw any errors, and I was able to incorrectly index_select these tensors without issue (I know for certain that the “index” value was able to go above the size of that axis, so it’s not that I wasn’t hitting that use-case), and the second is that the error only became apparent after the calls to index_select; i.e. I’m able to cause this error, but it won’t show up until I try to do something else, at which point it shows up as a cuda runtime error.
TL; DR: I think it makes sense to add some index checking in index_select that says “if you’re trying to select a value along this index which is greater than the length of the index minus one, raise an error,” or clamp it if there’s some other intended behavior. Otherwise it throws an assert error after the line that actually causes the error.