How to fix “CUDA error: device-side assert triggered” error?

I use huggingface Transformer to fine-tune a binary classification model. When I do inference job on big data. In rare case, it will trigger “CUDA error: device-side assert triggered” error, but when I debug the single wrong batch, it is strange that it can pass (both on GPU and CPU) , I don’t know why.

This error firstly trigger on

probs = probs.cpu().numpy()

and after that it will trigger on

input_ids = torch.tensor(batch['input_ids'], dtype=torch.long).to(device)

2021-11-18 20:18:41,251 - non_news_model.py[line:342] - ERROR: Traceback (most recent call last):
  File "/data/project/NonNewsInference/InferenceServices/non_news_model.py", line 307, in DoInference
    probs = probs.cpu().numpy()
RuntimeError: CUDA error: device-side assert triggered
, DoInference is dead!
2021-11-18 20:18:41,263 - non_news_model.py[line:345] - ERROR: Bad batch recorded!
2021-11-18 20:18:41,292 - non_news_model.py[line:342] - ERROR: Traceback (most recent call last):
  File "/data/project/NonNewsInference/InferenceServices/non_news_model.py", line 298, in DoInference
    dtype=torch.long).to(device)
RuntimeError: CUDA error: device-side assert triggered
, DoInference is dead!

Could anyone tell me how to solve this problem? Thanks!!

CUDA operations are executed asynchronously, so the stack trace might point to the wrong line of code. Rerun your script via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing operation in the reported stack trace. Often these asserts are triggered by an invalid indexing operation.

1 Like

Thank you @ptrblck, this error is occasionally triggered during model inference, not often. If I encounter this exception during large-scale data inference task, how can I accurately find the wrong batch of data? As your said, CUDA operations are asynchronously, if I catch the exception and log the bad batch, can I locate this wrong batch?

like this:

try:
	batch = inference_queue.get(block=True)
	with torch.no_grad():
		input_ids = torch.tensor(batch['input_ids'],
								 dtype=torch.long).to(device)
		attention_mask = torch.tensor(batch['attention_mask'],
									  dtype=torch.long).to(device)
		inputs = {
			"input_ids": input_ids,
			"attention_mask": attention_mask
		}
		logits = model(**inputs)[0]
		probs = softmax(logits)
	probs = probs.cpu().numpy()
except Exception as e:
	logger.error(f'{traceback.format_exc()}, DoInference is dead!')
	with open('./BadBatch.pkl', 'wb') as f:
		pickle.dump(batch, f)
	logger.error(f'Bad batch recorded!')

You could run the script with the aforementioned env variable, which would point to the operation raising the error.
Your approach could work, but note that once you are running into an assert the CUDA context might be corrupted and I don’t know if you would be able to store any additional data.

Thank you! I solve this problem. Due to the my tokenizer output did not match model vocabulary size. :grinning: