Has anyone found a solution by chance? I get the same error when launching a training from scratch of huggingface models Roberta and BERT (transformers/examples/language-modeling at master · huggingface/transformers · GitHub). I received many and many of this errors
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [372,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Then the stack trace:
Traceback (most recent call last):
File "/data/medioli/transformers/examples/language-modeling/run_mlm.py", line 491, in <module>
main()
File "/data/medioli/transformers/examples/language-modeling/run_mlm.py", line 457, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/data/medioli/env/lib/python3.6/site-packages/transformers/trainer.py", line 1053, in train
tr_loss += self.training_step(model, inputs)
File "/data/medioli/env/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step
loss = self.compute_loss(model, inputs)
File "/data/medioli/env/lib/python3.6/site-packages/transformers/trainer.py", line 1475, in compute_loss
outputs = model(**inputs)
File "/data/medioli/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/medioli/env/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/data/medioli/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/medioli/env/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 1057, in forward
return_dict=return_dict,
File "/data/medioli/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/medioli/env/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 810, in forward
past_key_values_length=past_key_values_length,
File "/data/medioli/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/medioli/env/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 123, in forward
embeddings += position_embeddings
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa4517ed1e2 in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fa451a3bf92 in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fa4517db9cd in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x25a (0x7fa427f8489a in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x28a (0x7fa427f79b1a in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fa427f593c2 in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fa4277577a6 in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xa6b08b (0x7fa427f5a08b in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x273c00 (0x7fa427762c00 in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x274e4e (0x7fa427763e4e in /data/medioli/env/lib64/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #22: main + 0x16e (0x400a3e in /data/medioli/env/bin/python3)
frame #23: __libc_start_main + 0xf5 (0x7fa48f4903d5 in /lib64/libc.so.6)
frame #24: /data/medioli/env/bin/python3() [0x400b02]