Device-side assert only when using a GPU

Hi, I’m getting a device-side assert related to out of bounds indexing, but only when using a GPU device. If I set CUDA_VISIBLE_DEVICES="" the code works as expected.

Pytorch version: 0.2.0.post3
CUDA version: 8.0

Trace:

/pytorch/torch/lib/THC/THCTensorIndex.cu:378: long calculateOffset(IndexType, LinearIndexCalcData<IndexType, Dims>) [with IndexType = unsigned int, Dims = 2U]: block: [9,0,0], thread: [32,0,0] Assertion indexAtDim < data.baseSizes[dim] failed.
/pytorch/torch/lib/THC/THCTensorIndex.cu:378: long calculateOffset(IndexType, LinearIndexCalcData<IndexType, Dims>) [with IndexType = unsigned int, Dims = 2U]: block: [9,0,0], thread: [33,0,0] Assertion indexAtDim < data.baseSizes[dim] failed.
/pytorch/torch/lib/THC/THCTensorIndex.cu:378: long calculateOffset(IndexType, LinearIndexCalcData<IndexType, Dims>) [with IndexType = unsigned int, Dims = 2U]: block: [9,0,0], thread: [34,0,0] Assertion indexAtDim < data.baseSizes[dim] failed.

THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/…/THCReduceAll.cuh line=334 error=59 : device-side assert triggered
Traceback (most recent call last):
File “paragraphvec/train.py”, line 194, in
fire.Fire()
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 127, in Fire
component_trace = _Fire(component, args, context, name)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 366, in _Fire
component, remaining_args)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File “paragraphvec/train.py”, line 94, in start
save_all)
File “paragraphvec/train.py”, line 142, in _run
x = cost_func.forward(x)
File “/home/nejc/dev/paragraph-vectors/paragraphvec/loss.py”, line 26, in forward
+ torch.sum(self._log_sigmoid(-scores[:, 1:]), dim=1) / k
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/torch/autograd/variable.py”, line 476, in sum
return Sum.apply(self, dim, keepdim)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/torch/autograd/_functions/reduce.py”, line 16, in forward
return input.new((input.sum(),))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/…/THCReduceAll.cuh:334
terminate called after throwing an instance of 'std::runtime_error’
what(): cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCStorage.c:182
Aborted (core dumped)

I’m not sure how to approach solving this as I can’t reproduce the bug on a CPU. Thanks for suggestions.

Could you run your code again with
CUDA_LAUNCH_BLOCKING=1 python your_script.py?
It might be the error occurred in another line of code.

Here it is:

THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCTensorIndex.cu line=586 error=59 : device-side assert triggered
Traceback (most recent call last):
File “paragraphvec/train.py”, line 194, in
fire.Fire()
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 127, in Fire
component_trace = _Fire(component, args, context, name)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 366, in _Fire
component, remaining_args)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File “paragraphvec/train.py”, line 94, in start
save_all)
File “paragraphvec/train.py”, line 141, in _run
batch.target_noise_ids)
File “/home/nejc/dev/paragraph-vectors/paragraphvec/models.py”, line 55, in forward
self._D[doc_ids, :], torch.sum(self._W[context_ids, :], dim=1))
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/torch/autograd/variable.py”, line 76, in getitem
return Index.apply(self, key)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/torch/autograd/_functions/tensor.py”, line 16, in forward
result = i.index(ctx.index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCTensorIndex.cu:586
terminate called after throwing an instance of 'std::runtime_error’
what(): cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCStorage.c:182
Aborted (core dumped)

It is indeed a different line:
x = torch.add(self._D[doc_ids, :], torch.sum(self._W[context_ids, :], dim=1))
I’m still not sure why it only occurs on a GPU, but at least I have a starting point for debugging. Thanks.

How to debug when ‘CUDA_LAUNCH_BLOCKING=1 python your_script.py’ is not supported, i.e. I am getting ‘cuda runtime error(30): no cuda device available’? Is there some other way?