How to fix “CUDA error: device-side assert triggered” error?

guoyaohua · November 23, 2021, 4:55am

I use huggingface Transformer to fine-tune a binary classification model. When I do inference job on big data. In rare case, it will trigger “CUDA error: device-side assert triggered” error, but when I debug the single wrong batch, it is strange that it can pass (both on GPU and CPU) , I don’t know why.

This error firstly trigger on

probs = probs.cpu().numpy()

and after that it will trigger on

input_ids = torch.tensor(batch['input_ids'], dtype=torch.long).to(device)

2021-11-18 20:18:41,251 - non_news_model.py[line:342] - ERROR: Traceback (most recent call last):
  File "/data/project/NonNewsInference/InferenceServices/non_news_model.py", line 307, in DoInference
    probs = probs.cpu().numpy()
RuntimeError: CUDA error: device-side assert triggered
, DoInference is dead!
2021-11-18 20:18:41,263 - non_news_model.py[line:345] - ERROR: Bad batch recorded!
2021-11-18 20:18:41,292 - non_news_model.py[line:342] - ERROR: Traceback (most recent call last):
  File "/data/project/NonNewsInference/InferenceServices/non_news_model.py", line 298, in DoInference
    dtype=torch.long).to(device)
RuntimeError: CUDA error: device-side assert triggered
, DoInference is dead!

Could anyone tell me how to solve this problem? Thanks!!

ptrblck · November 23, 2021, 4:59am

CUDA operations are executed asynchronously, so the stack trace might point to the wrong line of code. Rerun your script via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing operation in the reported stack trace. Often these asserts are triggered by an invalid indexing operation.

guoyaohua · November 23, 2021, 7:17am

Thank you @ptrblck, this error is occasionally triggered during model inference, not often. If I encounter this exception during large-scale data inference task, how can I accurately find the wrong batch of data? As your said, CUDA operations are asynchronously, if I catch the exception and log the bad batch, can I locate this wrong batch?

like this:

try:
	batch = inference_queue.get(block=True)
	with torch.no_grad():
		input_ids = torch.tensor(batch['input_ids'],
								 dtype=torch.long).to(device)
		attention_mask = torch.tensor(batch['attention_mask'],
									  dtype=torch.long).to(device)
		inputs = {
			"input_ids": input_ids,
			"attention_mask": attention_mask
		}
		logits = model(**inputs)[0]
		probs = softmax(logits)
	probs = probs.cpu().numpy()
except Exception as e:
	logger.error(f'{traceback.format_exc()}, DoInference is dead!')
	with open('./BadBatch.pkl', 'wb') as f:
		pickle.dump(batch, f)
	logger.error(f'Bad batch recorded!')

ptrblck · November 23, 2021, 7:25am

You could run the script with the aforementioned env variable, which would point to the operation raising the error.
Your approach could work, but note that once you are running into an assert the CUDA context might be corrupted and I don’t know if you would be able to store any additional data.

guoyaohua · November 29, 2021, 8:32am

Thank you! I solve this problem. Due to the my tokenizer output did not match model vocabulary size.

Sokhib_Tukhtaev · April 25, 2023, 1:33am

could you describe what was your solution to that error? I am also facing CUDA error: device-side assert triggered error when running inference with yolov8’s tracking. Thank you.

jyothir07 · June 16, 2023, 6:47am

Have the same error while trying to resume YOLOV8:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ptrblck · June 16, 2023, 1:00pm

Rerun your script via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and check the failing operation in the reported stack trace as already mentioned in this topic.

shyam_llama · October 31, 2023, 6:41am

thanks your comment made me to re-check tokenizer and model are from same repo

guoyaohua · November 13, 2023, 6:45am

Run your code with CPU device, you will find the actual error.

Khusboo_chaudhry · May 15, 2024, 3:43am

I encountered the same error:
Steps which helped me to resolve the error:

write these two lines in the first file being executed. It helped to show the true/exact line of code in the current file which was creating problem.
import os
os.environ[“CUDA_LAUNCH_BLOCKING”] = “1”
For my case error was in line:
channel_select_filtered_positive = all_filtered.view(-1)[indices.long()].view(1, height, width)
change to
channel_select_filtered_positive = all_filtered.view(-1)[indices.int()].view(1, height, width)
resolved the error.

ptrblck · May 15, 2024, 12:53pm

I would recommend understanding the fix in detail as it seems the transformation from long to int and thus also the corresponding numerical range reduction “fixed” the issue while it seems the clipping is a side effect.

Loreto_Parisi · August 26, 2024, 6:15pm

I’m facing the same issue in my Transformer block’s forward pass in Llama3.
Using the CUDA debugging env I get the error line detected and in fact it was:

def forward(self, tokens: torch.Tensor):
        """Perform a forward pass through the Transformer model.

        Args:
            tokens (torch.Tensor): Input token indices.

        Returns:
            torch.Tensor: Output logits after applying the Transformer model.

        """
        
       # ERROR RAISES HERE 
       # passthrough for nonexistent layers, allows easy configuration of pipeline parallel stages
        h = self.tok_embeddings(tokens) if self.tok_embeddings else tokens

        for layer in self.layers.values():
            h = layer(h, self.freqs_cis)

        h = self.norm(h) if self.norm else h
        return self.output(h).float() if self.output else h

How did you fixed it?

Vaibhav_Hiwase · September 22, 2024, 1:27am

It seems that when using any tokenizer from Hugging Face, it must be initialized and run on a CPU device. Running it directly on a GPU device might cause device-side assertion errors.

To potentially resolve this issue, I would suggest modifying your code as follows:
Please change your code as below

input_ids = torch.tensor(batch['input_ids'], dtype=torch.long).to(device)

To

input_ids = torch.tensor(batch['input_ids'], dtype=torch.long)

This might solve the problem.

yan-hao-tian · January 15, 2025, 4:24pm

太行了哥 thanks too much bro👍

samedilmed · February 13, 2025, 10:24am

I had a similar isssue, and I had to chnage type of a masking variable (when working on cuda device).
I changed this:

valid_mask = scores >= self.det_threshold. # type: torch.bool()

into this:

valid_mask = (scores >= self.det_threshold).long(). # type: torch.long()

Doing so, it worked.

NGUY_N_VAN_QUANG_TR · April 23, 2025, 1:12pm