RuntimeError when using BertModel from Huggingface

I have run into this error when using the BertModel from Huggingface transformers, version 2.10.0 while it works with version 2.1.1 and on CPU:

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexT
ype>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, Ds
tDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [145,0,0], thread: [95,0,0] Assertion srcIndex < srcS electDimSize failed.
Traceback (most recent call last):
File “run.py”, line 58, in
cmd(args)
File “/project/piqasso/tools/biaffine-parser/parser/cmds/train.py”, line 82, in call
self.train(train.loader)
File “/project/piqasso/tools/biaffine-parser/parser/cmds/cmd.py”, line 83, in train
arc_scores, rel_scores = self.model(words, feats)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/project/piqasso/tools/biaffine-parser/parser/model.py”, line 90, in forward
feat_embed = self.feat_embed(*feats)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/project/piqasso/tools/biaffine-parser/parser/modules/bert.py”, line 43, in forward
bert = bert[bert_mask].split(bert_lens[mask].tolist())
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

Any hints on how to find the cause?

Than you.

I believe the problem was due to a sentence of length 383, which was split into 672 wordpieces.

I traced te error to this call, in
lib64/python3.6/site-packages/torch/nn/modules/sparse.py

111 -> def forward(self, input):
112 return F.embedding(
113 input, self.weight, self.padding_idx, self.max_norm,
114 self.norm_type, self.scale_grad_by_freq, self.sparse)

where input = arange(672)
(Pdb)
–Return–
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered

So it likely hits the limit of 512 of BERT inputs and breaks the memory.
I wonder why they don’t check this limit and avoid memory corruption.

Based on the error message, it seems that you are using the GPU:

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexT
ype>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, Ds
tDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [145,0,0], thread: [95,0,0] Assertion srcIndex < srcS electDimSize failed.

Which PyTorch version are you using? Could you update to the nightly binaries and rerun the code, please?

I tried also with torch 1.5.
But it looks like an issue with torch.nn.embedding(), which does not check for the limit in its input vector.

Did the nightly binaries yield another error message? Recently an issue was fixed, which prevented the debugging asserts to raise a proper error message and let the code ran into an illegal memory access.

I haven’t tried that. I added a check in my code before invoking method forward(), which is worthwhile doing anyhow.
Thank you anyhow.