Hi, I’m getting a device-side assert related to out of bounds indexing, but only when using a GPU device. If I set CUDA_VISIBLE_DEVICES=“” the code works as expected.
Pytorch version: 0.2.0.post3
CUDA version: 8.0
Trace:
/pytorch/torch/lib/THC/THCTensorIndex.cu:378: long calculateOffset(IndexType, LinearIndexCalcData<IndexType, Dims>) [with IndexType = unsigned int, Dims = 2U]: block: [9,0,0], thread: [32,0,0] Assertion
indexAtDim < data.baseSizes[dim]
failed.
/pytorch/torch/lib/THC/THCTensorIndex.cu:378: long calculateOffset(IndexType, LinearIndexCalcData<IndexType, Dims>) [with IndexType = unsigned int, Dims = 2U]: block: [9,0,0], thread: [33,0,0] AssertionindexAtDim < data.baseSizes[dim]
failed.
/pytorch/torch/lib/THC/THCTensorIndex.cu:378: long calculateOffset(IndexType, LinearIndexCalcData<IndexType, Dims>) [with IndexType = unsigned int, Dims = 2U]: block: [9,0,0], thread: [34,0,0] AssertionindexAtDim < data.baseSizes[dim]
failed.
…
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/…/THCReduceAll.cuh line=334 error=59 : device-side assert triggered
Traceback (most recent call last):
File “paragraphvec/train.py”, line 194, in
fire.Fire()
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 127, in Fire
component_trace = _Fire(component, args, context, name)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 366, in _Fire
component, remaining_args)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/fire/core.py”, line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File “paragraphvec/train.py”, line 94, in start
save_all)
File “paragraphvec/train.py”, line 142, in _run
x = cost_func.forward(x)
File “/home/nejc/dev/paragraph-vectors/paragraphvec/loss.py”, line 26, in forward
+ torch.sum(self._log_sigmoid(-scores[:, 1:]), dim=1) / k
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/torch/autograd/variable.py”, line 476, in sum
return Sum.apply(self, dim, keepdim)
File “/home/nejc/dev/paragraph-vectors/env/lib/python3.5/site-packages/torch/autograd/_functions/reduce.py”, line 16, in forward
return input.new((input.sum(),))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/…/THCReduceAll.cuh:334
terminate called after throwing an instance of ‘std::runtime_error’
what(): cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCStorage.c:182
Aborted (core dumped)
I’m not sure how to approach solving this as I can’t reproduce the bug on a CPU. Thanks for suggestions.