How to fix “CUDA error: device-side assert triggered” error?

I use huggingface Transformer to fine-tune a binary classification model. When I do inference job on big data. In rare case, it will trigger “CUDA error: device-side assert triggered” error, but when I debug the single wrong batch, it is strange that it can pass (both on GPU and CPU) , I don’t know why.

This error firstly trigger on

probs = probs.cpu().numpy()

and after that it will trigger on

input_ids = torch.tensor(batch['input_ids'], dtype=torch.long).to(device)

2021-11-18 20:18:41,251 - non_news_model.py[line:342] - ERROR: Traceback (most recent call last):
  File "/data/project/NonNewsInference/InferenceServices/non_news_model.py", line 307, in DoInference
    probs = probs.cpu().numpy()
RuntimeError: CUDA error: device-side assert triggered
, DoInference is dead!
2021-11-18 20:18:41,263 - non_news_model.py[line:345] - ERROR: Bad batch recorded!
2021-11-18 20:18:41,292 - non_news_model.py[line:342] - ERROR: Traceback (most recent call last):
  File "/data/project/NonNewsInference/InferenceServices/non_news_model.py", line 298, in DoInference
    dtype=torch.long).to(device)
RuntimeError: CUDA error: device-side assert triggered
, DoInference is dead!

Could anyone tell me how to solve this problem? Thanks!!

CUDA operations are executed asynchronously, so the stack trace might point to the wrong line of code. Rerun your script via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing operation in the reported stack trace. Often these asserts are triggered by an invalid indexing operation.

5 Likes

Thank you @ptrblck, this error is occasionally triggered during model inference, not often. If I encounter this exception during large-scale data inference task, how can I accurately find the wrong batch of data? As your said, CUDA operations are asynchronously, if I catch the exception and log the bad batch, can I locate this wrong batch?

like this:

try:
	batch = inference_queue.get(block=True)
	with torch.no_grad():
		input_ids = torch.tensor(batch['input_ids'],
								 dtype=torch.long).to(device)
		attention_mask = torch.tensor(batch['attention_mask'],
									  dtype=torch.long).to(device)
		inputs = {
			"input_ids": input_ids,
			"attention_mask": attention_mask
		}
		logits = model(**inputs)[0]
		probs = softmax(logits)
	probs = probs.cpu().numpy()
except Exception as e:
	logger.error(f'{traceback.format_exc()}, DoInference is dead!')
	with open('./BadBatch.pkl', 'wb') as f:
		pickle.dump(batch, f)
	logger.error(f'Bad batch recorded!')

You could run the script with the aforementioned env variable, which would point to the operation raising the error.
Your approach could work, but note that once you are running into an assert the CUDA context might be corrupted and I don’t know if you would be able to store any additional data.

Thank you! I solve this problem. Due to the my tokenizer output did not match model vocabulary size. :grinning:

1 Like

could you describe what was your solution to that error? I am also facing CUDA error: device-side assert triggered error when running inference with yolov8’s tracking. Thank you.

Have the same error while trying to resume YOLOV8:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Rerun your script via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and check the failing operation in the reported stack trace as already mentioned in this topic.

1 Like

thanks your comment made me to re-check tokenizer and model are from same repo

Run your code with CPU device, you will find the actual error.

1 Like