Only Cased models giving CUDA error: Device-side assert triggered in QA span selection task, whereas Uncased models are working fine

Traceback (most recent call last):
File “run_techqa3.py”, line 629, in
main()
File “run_techqa3.py”, line 623, in main
model = train(args, train_dataset, model, optimizer, tokenizer, model_evaluator)
File “run_techqa3.py”, line 222, in train
outputs = model(**inputs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/Chunk_Extraction/TechQA-Base/techqa-master/model_techqa3.py”, line 88, in forward
outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/transformers/modeling_bert.py”, line 790, in forward
encoder_attention_mask=encoder_extended_attention_mask,
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/transformers/modeling_bert.py”, line 407, in forward
hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/transformers/modeling_bert.py”, line 368, in forward
self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/transformers/modeling_bert.py”, line 314, in forward
hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/transformers/modeling_bert.py”, line 251, in forward
context_layer = torch.matmul(attention_probs, value_layer)
RuntimeError: CUDA error: device-side assert triggered

With CUDA_LAUNCHING_BLOCK =1

Traceback (most recent call last):
File “run_techqa3.py”, line 629, in
main()
File “run_techqa3.py”, line 592, in main
model, optimizer = load_model(args, model_class, config)
File “run_techqa3.py”, line 69, in load_model
model = BERTplusAoA(config, args)
File “/dccstor/sahban21_2905/Chunk_Extraction/TechQA-Base/techqa-master/model_techqa3.py”, line 23, in init
self.wb = nn.Parameter(torch.tensor([0.5]).cuda())
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

Based on this error message:

RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

it seems your PyTorch application isn’t able to use the GPU.
Were you able to use the device before? If so, did you change something in the system (updates etc.)?
If you’ve updated the driver (or any other CUDA component), did you restart the machine?
Are other programs potentially blocking the device?

If I run any uncased, it doesn’t show any error and works fine. I didn’t make any changes in the environment. I am working on a gpu cluster, sometimes the gpus aren’t available. But in this case if I run any other models, it works flawlessly.

I tried using Bert-large-wwm-cased, it gave this error CUDA error: Device side assert triggered. Immediately after that I ran bert-large-wwm-uncased, it worked fine.

1 Like

What is determining, if the GPUs are available?

To isolate the assert, you would have to rerun the code with CUDA_LAUNCH_BLOCKING=1 and post the stack trace, which is hopefully different than the previous error.

I ran it on my computer with no gpu, the model is getting trained without any error.

With CUDA_LAUNCH_BLOCKING =1

Traceback (most recent call last):
File “run_techqa2.py”, line 629, in
main()
File “run_techqa2.py”, line 623, in main
model = train(args, train_dataset, model, optimizer, tokenizer, model_evaluator)
File “run_techqa2.py”, line 222, in train
outputs = model(**inputs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/Chunk_Extraction/TechQA-Base/techqa-master/model_techqa1.py”, line 88, in forward
outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/transformers/modeling_bert.py”, line 783, in forward
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/transformers/modeling_bert.py”, line 174, in forward
position_embeddings = self.position_embeddings(position_ids)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/modules/sparse.py”, line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File “/dccstor/sahban21_2905/miniconda3/envs/chunk_ext/lib/python3.7/site-packages/torch/nn/functional.py”, line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

It seems that the embedding layer is raising this error.
This is often the case, if you pass invalid indices to this layer.
Could you add assert statements and check the inputs?
E.g.:

num_embeddings = 10
emb = nn.Embedding(num_embeddings=num_embeddings, embedding_dim=100).cuda()
input = torch.randint(0, num_embeddings, (5,)).cuda()

assert (input >= 0).all() and (input < num_embeddings).all(), 'INVALID INPUT'
out = emb(input)


input[0] = num_embeddings # will trigger assert
assert (input >= 0).all() and (input < num_embeddings).all(), 'INVALID INPUT'
out = emb(input)

It was also giving this warning/error when we used CUDA_LAUNCH_BLOCKING = 1

[A/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [308,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [308,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [308,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.

An indexing operation fails, which could be raised by invalid indices to the embedding layer.

I had the same issue using bert-base-cased model. Resolved it after changing to bert-base-uncased. However, my application might work better with the cased model