RuntimeError: cuda runtime error (59)

kundan2510 · February 27, 2017, 5:12am

I am new to pytorch and tried to Run a seq2seq model. My code runs well for few batches (around 4 to 5) before I get the following error:
…
RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu:76

When I enable cudnn, I get the following error:
…
cudnn.rnn.forward(self, input, hx, weight, output, hy)
File “/u/kumarkun/.local/lib/python2.7/site-packages/torch/backends/cudnn/rnn.py”, line 241, in forward
w.zero_()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:26

Strangely, the source code line-number that causes the error is different for different runs but
Has anyone encountered such an issue before? Is this a known issue and is there an easy way to fix this?

I would greatly appreciate any pointers.

kundan2510 · February 27, 2017, 5:26am

I had an issue with the embedding layer. I had an index which was greater than the vocab-size. Fixing that resolved the issue. I still don’t understand how the error shown is related to that.

smth · February 27, 2017, 5:07pm

Because of the asynchronous nature of cuda, the assert might not point to a full correct stack trace pointing to where the assert was triggered from.

if you run the program with CUDA_LAUNCH_BLOCKING=1 python script.py
this will help get a more exact stack trace

ajbrock · March 2, 2017, 8:09am

So I ran into the same error message for a similar reason just now, and I think it makes sense to add in some error handling to catch it. Feel free to skip to the TLDR if you don’t have care for STORYTIME.

STORY TIME!

I recently re-installed my NVIDIA drivers (someone in the lab broke the server trying to install them, and for the life of me I have no idea what went wrong) and managed to compile the latest PyTorch build from source (which previously had thrown errors related to CUDA, suggesting that PyTorch unsurprisingly doesn’t compile if you’re using old drivers). Yay!

Then I went to run a little piece of code that’s basically just GPU-accelerated numpy, and which I had run about a thousand times previously without error, and it starts throwing the device-side assert error 59. The strangest thing about this was that it was throwing the error on an allocation call–the exact line was X=torch.cuda.HalfTensor([5,4,3]).

It turned out the issue was in the line above, where I call index_select on a different tensor using a variable that can vary widely–the issue was that I clamped my indices incorrectly:

out = inp.index_select(2, torch.clamp(index, 0, inp.size()[-1])).

Which should obviously be clamped to 0, inp.size()[-1] - 1 since python is zero-indexed. (Sidenote, switching between python and MATLAB daily is awful). Anyhow, there’s two interesting things about this–the first is that while using the previous drivers (361.45) this did not ever throw any errors, and I was able to incorrectly index_select these tensors without issue (I know for certain that the “index” value was able to go above the size of that axis, so it’s not that I wasn’t hitting that use-case), and the second is that the error only became apparent after the calls to index_select; i.e. I’m able to cause this error, but it won’t show up until I try to do something else, at which point it shows up as a cuda runtime error.

TL; DR: I think it makes sense to add some index checking in index_select that says “if you’re trying to select a value along this index which is greater than the length of the index minus one, raise an error,” or clamp it if there’s some other intended behavior. Otherwise it throws an assert error after the line that actually causes the error.

smth · March 2, 2017, 3:38pm

@ajbrock the fundamental problem with throwing more informative errors is that they are not allowed by CUDA. CUDA can only do device asserts, and wont even tell you what the triggered assert it. We are constrained by this to make stuff more user-friendly.

ajbrock · March 3, 2017, 6:14pm

I think it makes sense not to try and catch all the CUDA-side errors, but would it be possible to institute a check in the index_select function itself (similar to the fixes for the recent slicing issues, maybe) that ensure that users don’t try to grab an index that’s not in the array? I.e. if X.size()=[5,5,5] and you call X.index_select(2,5) (or more egregiously X.index_select(2,7)) it raises an error.

Or is the tensor.index_select function just something that’s hard-baked into CUDA that PyTorch doesn’t actually wrap? I might be misunderstanding how it’s put together.

apaszke · March 3, 2017, 8:21pm

What’s index_select(2, 5)? It expects a Tensor argument as the second argument

ajbrock · March 3, 2017, 10:18pm

Sorry, I was just trying to do that as an example of indexing outside the allowable array size. Here’s a better example:

X = torch.randn(5,5,5)
y = torch.LongTensor([5])
X.index_select(2,y)

returns RuntimeError: out of range at /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.7_1485444530918/work/torch/lib/TH/generic/THTensor.c:379

but if I run
X = torch.randn(5,5,5).cuda()
y = torch.LongTensor([5]).cuda()
X.index_select(2,y) # Works, but should not
X.index_select(2,y+5) # Also works, but really should not

On the version I installed off conda this allows me to grab random junk (presumably it’s grabbing things from nearby memory locations) but it doesn’t throw any errors. On the version I compiled yesterday this may or may not throw a device-side assert error (the GPUs are all tied up on that machine so I can’t properly test it atm) but in the case I was previously encountering it was not throwing that assert error until after I had done the bad indexing.

Is it not possible to insert a value check in the index_select function to ensure that the elements of the second argument are all less than the size of the indexed tensor along that axis?

apaszke · March 3, 2017, 11:18pm

It’s impossible, because we’d have to execute max and min over the whole tensor, synchronize the CPU with the GPU, and execute the index_select kernel only if the values are correct. It would be too slow.

ajbrock · March 3, 2017, 11:23pm

Got it, that makes sense.

11129 · October 21, 2017, 7:37am

This help me, thanks.

euwern · December 27, 2017, 7:31am

same here, thanks for the hint… I was converting some code from torch to pytorch. Lua is 1 based index but Python is 0 based index. My dataset was created for lua hence it is off by one.

gleb · January 6, 2018, 10:30am

I had this error when have created the model with the wrong number of classes:

model=resnet.resnet34(num_classes=2), whereas there were 4 possible classes. Setting num_classes=4 fixed the problem.

miladiouss · June 13, 2018, 6:35am

Similar to gleb , I faced this error when my class labels were wrong.
I had 3 classes but the class indices were not 0, 1, 2 and instead it was 3, 5, 6.

Fengzi · September 12, 2018, 2:18am

I faced with the problem that is very similar: ‘reduce failed to synchronize: device-side assert triggered’,and the problem occurred in the circumstance that I used the BCEloss.finally I found the cause that my output is a mat consists of the elements that is negative numbers,but the implement of BCEloss is about log(),which has no meaning in the negative area, thus I add a sigmoid to constrain the output between 0-1and my problem is solved.

jdenim · December 5, 2018, 5:25pm

How did you check the class indices?
and 2) how did you change the class indices from 3,5,6 to 0,1,2?

Thanks

m0nster · February 12, 2019, 3:31pm

I am getting the same error. I have 10 classes in my dataset. I have seen that labels are in the in the range of 1 to 10 instead or 0 to 9. How do I fix that. @miladiouss @euwern

tom · February 12, 2019, 3:36pm

I’m not sure if I’m missing the obvious, but how about subtracting 1 from the labels?

Best regards

Thomas

m0nster · February 13, 2019, 12:41pm

Yes @tom. It will work if we subtract 1 from labels. but my question is different.
I have done another classification problem and in that the labels started from 0.
Now I am doing another classification problem and in here the labels start from 1.
Why is this happening? and how to prevent that?

tom · February 13, 2019, 8:58pm

Well, in data preparation, you would map whatever classes you have to 0 ... number of classes-1, whether that is what you start with or you have words as classes or some otherwise numbered classes. One example of how to do this in Python is given in Jeremy Howards fast.ai classes when he does language models.