Cuda RuntimeError: CUDA error: device-side assert triggered (occuring only with some data)

Hello

I’m trying to do prediction using a pre-trained model. But I’m getting the following error with some data rows

“the code is executed normally but with some data rows I got the following error”

, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [216,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [216,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File “generate_paraphrases.py”, line 198, in
encode_data(out_file=args.out_file)
File “generate_paraphrases.py”, line 81, in encode_data
torch_sent = Variable(torch.from_numpy(np.array(seg_sent, dtype=‘int32’)).long().cuda())
RuntimeError: CUDA error: device-side assert triggered

The Error is caused sometimes when executing
in_embs = self.trans_embs(trans)

"cuda doesn’t point the right line "

Noting that my code is working fine with the same data on windows, and I got those errors only on Linux server.

I tried to catch the exception and continue the execution, but I couldn’t use Cuda after the error

Any Ideas plz?

You can see in the exception " Assertion srcIndex < srcSelectDimSize failed." This means that you tried to index a Tensor with an index that is larger than it’s size.
You want to check in more details all your indices and make sure they are correct.
Note that to get a stack trace that points exactly to the operation that cause the issue, you can either run on the CPU or set CUDA_LAUNCH_BLOCKING=1 before launching you script on the GPU.

Hello
Thank you. I fixed my problem following your advice “I checked all indexes in my code until I found the out of band index”
But I am wondering why this problem is occurring only on the server? The same code was executed without any errors on my windows machine!

What was the problem? Maybe the dataset was not saved the same way on both machines? Or empty/invalid ops might give something different?