Cuda RuntimeError: CUDA error: device-side assert triggered (occuring only with some data)


(Mkwissam) #1

Hello

I’m trying to do prediction using a pre-trained model. But I’m getting the following error with some data rows

“the code is executed normally but with some data rows I got the following error”

, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [216,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [216,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File “generate_paraphrases.py”, line 198, in
encode_data(out_file=args.out_file)
File “generate_paraphrases.py”, line 81, in encode_data
torch_sent = Variable(torch.from_numpy(np.array(seg_sent, dtype=‘int32’)).long().cuda())
RuntimeError: CUDA error: device-side assert triggered

The Error is caused sometimes when executing
in_embs = self.trans_embs(trans)

"cuda doesn’t point the right line "

Noting that my code is working fine with the same data on windows, and I got those errors only on Linux server.

I tried to catch the exception and continue the execution, but I couldn’t use Cuda after the error

Any Ideas plz?


(Alban D) #2

You can see in the exception " Assertion srcIndex < srcSelectDimSize failed." This means that you tried to index a Tensor with an index that is larger than it’s size.
You want to check in more details all your indices and make sure they are correct.
Note that to get a stack trace that points exactly to the operation that cause the issue, you can either run on the CPU or set CUDA_LAUNCH_BLOCKING=1 before launching you script on the GPU.


(Mkwissam) #3

Hello
Thank you. I fixed my problem following your advice “I checked all indexes in my code until I found the out of band index”
But I am wondering why this problem is occurring only on the server? The same code was executed without any errors on my windows machine!


(Alban D) #4

What was the problem? Maybe the dataset was not saved the same way on both machines? Or empty/invalid ops might give something different?