Cublas runtime error

mbarnes1 · December 17, 2017, 2:37am

I get this cublas error after several epochs of training. The data is fed sequentially and the error always occurs at a different iteration, so it is difficult to reproduce in a simple example. Something is causing an assertion on line 97 of https://github.com/torch/cutorch/blob/master/lib/THC/THCTensorScatterGather.cu. Any ideas of what could be happening, or what I should be checking for? I’ve tried just catching it and continuing onto the next iteration, but then there are just more (different) cublas errors on the next iteration, too.

The error always occurs during a forward call of a torch.nn.Linear object

RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/torch/lib/THC/THCBlas.cu:246
/pytorch/torch/lib/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [29,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

Thanks

SimonW · December 17, 2017, 4:26am

What does you architecture look like?

mbarnes1 · December 17, 2017, 4:50am

It’s a custom RNN decoder for image captioning, described in this paper: https://arxiv.org/abs/1707.07998. Suppose I could give the full source code, though really wish I could provide a simple minimal working example.

Also if I do catch this exception and move onto the next iteration, I get this CUDA runtime error just trying to even move a tensor to the GPU for initialization.

File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 106, in init_state
state_attention = (hidden_attention_lstm.cuda(), cell_attention_lstm.cuda())
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py”, line 298, in cuda
return CudaTransfer.apply(self, device, async)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/tensor.py”, line 201, in forward
return i.cuda(async=async)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/_utils.py”, line 69, in cuda
return new_type(self.size()).copy(self, async)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCTensorCopy.c:20
/pytorch/torch/lib/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [30,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

ngimel · December 17, 2017, 7:30am

Cublas error is a red herring, your true problem is an assert thrown by gather kernel - is it possible that something is wrong with the indices that are sent to it? After assert us thrown, cuda context is corrupted, so don’t try catching it and continuing, the errors that you see are expected. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#assertion

mbarnes1 · December 18, 2017, 4:39pm

@ngimel How can I check for that? I’ve asserted that all the tensors being fed to this torch.nn.Linear layer are the correct dimension and don’t contain nan’s or inf values. Not sure what else could be corrupted.

ngimel · December 18, 2017, 4:53pm

The problematic layer is not torch.nn.Linear, it’s something like index_select, or some advanced indexing that you are possibly using, hard to say without looking at your model.

mbarnes1 · December 18, 2017, 5:02pm

Okay. Just to clarify, even though the traceback (below) shows the error at torch.nn.Linear and I’ve checked the inputs to this layer, the issue could be caused by advanced indexing happening somewhere else? Doesn’t exactly make sense to me, but I can certainly go through and check the indexing in the rest of my code. Thanks

Traceback (most recent call last):
File “example_server.py”, line 74, in
validate_every=100, validater_examples=validater_examples, validate_examples_every=1000, use_cuda=True)
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/train.py”, line 61, in train
_, logprobs = decoder.forward(V, y_0, state_attention, state_language, y_true=caption_true_packed)
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 78, in forward
(state_language[0][0:batch_size_t], state_language[1][0:batch_size_t]))
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 158, in forward
v_hat = self.attention.forward(V, new_state_attention[0]) # N x D
File “/zfsauton/home/mbarnes1/aggrevated/aggrevated-coco/models/networks.py”, line 191, in forward
V_hidden = [self.image_linear(v) for v in V] # List of length N, containing 1 x k x H
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 325, in call
result = self.forward(*input, **kwargs)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/nn/modules/linear.py”, line 55, in forward
return F.linear(input, self.weight, self.bias)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/nn/functional.py”, line 837, in linear
output = input.matmul(weight.t())
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py”, line 386, in matmul
return torch.matmul(self, other)
File “/zfsauton/home/mbarnes1/miniconda2/envs/torch/lib/python2.7/site-packages/torch/functional.py”, line 191, in matmul
output = torch.mm(tensor1, tensor2)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/torch/lib/THC/THCBlas.cu:246
/pytorch/torch/lib/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [29,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

ngimel · December 18, 2017, 5:06pm

Here’s the part of the stack trace that corrupted your cuda context, all subsequent errors are stemming from this.

mbarnes1 · December 18, 2017, 5:21pm

Got it. Trying to understand where in my code there are corrupted indices, but it seems like all we know based on this traceback is its due to something that invokes the gather kernel. I’ll start with checking all the advanced indexing. Thanks for your help

SimonW · December 18, 2017, 5:49pm

Try adding torch.cuda.synchronize after each cuda operation. That way the error won’t be deferred to later operations.
http://pytorch.org/docs/master/cuda.html#torch.cuda.synchronize

If you can share your script, I can take a look.

smth · December 18, 2017, 5:51pm

an easy way to get the correct stack trace with CUDA is to run:

CUDA_LAUNCH_BLOCKING=1 python yourscript.py

This puts cuda in synchronous mode (rather than it’s default of asynchronous mode)

mbarnes1 · December 19, 2017, 3:08pm

Ran a couple instances of training yesterday with CUDA_LAUNCH_BLOCKING=1, and was able to debug the issue based on the better stack trace. Long story short, it ended up being an issue with torch.gather very rarely receiving an invalid index. Thanks everyone for your help.

Rafael_Valle · January 17, 2018, 5:42am

Did the trick for me!

ezyang · April 26, 2018, 10:52pm

@Rafael_Valle Erm, what exactly did you do?

Rafael_Valle · April 26, 2018, 11:27pm

called torch.cuda.synchronize()

isalirezag · February 25, 2019, 3:19pm

where did you put it?

Rafael_Valle · February 25, 2019, 6:38pm

Take a look at SimonW’s answer:
“Try adding torch.cuda.synchronize after each cuda operation.”

abdullahkhilji · June 30, 2019, 5:22pm

Yes, torch.gather is very efficient, but we need to be careful while using it. Thanks !