Cublas runtime error

I get this cublas error after several epochs of training. The data is fed sequentially and the error always occurs at a different iteration, so it is difficult to reproduce in a simple example. Something is causing an assertion on line 97 of https://github.com/torch/cutorch/blob/master/lib/THC/THCTensorScatterGather.cu. Any ideas of what could be happening, or what I should be checking for? I’ve tried just catching it and continuing onto the next iteration, but then there are just more (different) cublas errors on the next iteration, too.

The error always occurs during a forward call of a torch.nn.Linear object

RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/torch/lib/THC/THCBlas.cu:246
/pytorch/torch/lib/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [29,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

Thanks

What does you architecture look like?

It’s a custom RNN decoder for image captioning, described in this paper: https://arxiv.org/abs/1707.07998. Suppose I could give the full source code, though really wish I could provide a simple minimal working example.

Also if I do catch this exception and move onto the next iteration, I get this CUDA runtime error just trying to even move a tensor to the GPU for initialization.

File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 106, in init_state
state_attention = (hidden_attention_lstm.cuda(), cell_attention_lstm.cuda())
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py”, line 298, in cuda
return CudaTransfer.apply(self, device, async)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/tensor.py”, line 201, in forward
return i.cuda(async=async)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/_utils.py”, line 69, in cuda
return new_type(self.size()).copy
(self, async)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCTensorCopy.c:20
/pytorch/torch/lib/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [30,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

Cublas error is a red herring, your true problem is an assert thrown by gather kernel - is it possible that something is wrong with the indices that are sent to it? After assert us thrown, cuda context is corrupted, so don’t try catching it and continuing, the errors that you see are expected. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#assertion

@ngimel How can I check for that? I’ve asserted that all the tensors being fed to this torch.nn.Linear layer are the correct dimension and don’t contain nan’s or inf values. Not sure what else could be corrupted.

The problematic layer is not torch.nn.Linear, it’s something like index_select, or some advanced indexing that you are possibly using, hard to say without looking at your model.

Okay. Just to clarify, even though the traceback (below) shows the error at torch.nn.Linear and I’ve checked the inputs to this layer, the issue could be caused by advanced indexing happening somewhere else? Doesn’t exactly make sense to me, but I can certainly go through and check the indexing in the rest of my code. Thanks

Traceback (most recent call last):
File “example_server.py”, line 74, in
validate_every=100, validater_examples=validater_examples, validate_examples_every=1000, use_cuda=True)
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/train.py”, line 61, in train
_, logprobs = decoder.forward(V, y_0, state_attention, state_language, y_true=caption_true_packed)
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 78, in forward
(state_language[0][0:batch_size_t], state_language[1][0:batch_size_t]))
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 158, in forward
v_hat = self.attention.forward(V, new_state_attention[0]) # N x D
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 191, in forward
V_hidden = [self.image_linear(v) for v in V] # List of length N, containing 1 x k x H
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 325, in call
result = self.forward(*input, **kwargs)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/nn/modules/linear.py”, line 55, in forward
return F.linear(input, self.weight, self.bias)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/nn/functional.py”, line 837, in linear
output = input.matmul(weight.t())
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py”, line 386, in matmul
return torch.matmul(self, other)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/functional.py”, line 191, in matmul
output = torch.mm(tensor1, tensor2)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/torch/lib/THC/THCBlas.cu:246
/pytorch/torch/lib/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [29,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

Here’s the part of the stack trace that corrupted your cuda context, all subsequent errors are stemming from this.

Got it. Trying to understand where in my code there are corrupted indices, but it seems like all we know based on this traceback is its due to something that invokes the gather kernel. I’ll start with checking all the advanced indexing. Thanks for your help

Try adding torch.cuda.synchronize after each cuda operation. That way the error won’t be deferred to later operations.
http://pytorch.org/docs/master/cuda.html#torch.cuda.synchronize

If you can share your script, I can take a look.

1 Like

an easy way to get the correct stack trace with CUDA is to run:

CUDA_LAUNCH_BLOCKING=1 python yourscript.py

This puts cuda in synchronous mode (rather than it’s default of asynchronous mode)

5 Likes

Ran a couple instances of training yesterday with CUDA_LAUNCH_BLOCKING=1, and was able to debug the issue based on the better stack trace. Long story short, it ended up being an issue with torch.gather very rarely receiving an invalid index. Thanks everyone for your help.

3 Likes

Did the trick for me!

@Rafael_Valle Erm, what exactly did you do?

called torch.cuda.synchronize()

where did you put it?

Take a look at SimonW’s answer:
“Try adding torch.cuda.synchronize after each cuda operation.”

Yes, torch.gather is very efficient, but we need to be careful while using it. Thanks !