I use huggingface Transformer to fine-tune a binary classification model. When I do inference job on big data. In rare case, it will trigger “CUDA error: device-side assert triggered” error, but when I debug the single wrong batch, it is strange that it can pass (both on GPU and CPU) , I don’t know why.
CUDA operations are executed asynchronously, so the stack trace might point to the wrong line of code. Rerun your script via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the failing operation in the reported stack trace. Often these asserts are triggered by an invalid indexing operation.
Thank you @ptrblck, this error is occasionally triggered during model inference, not often. If I encounter this exception during large-scale data inference task, how can I accurately find the wrong batch of data? As your said, CUDA operations are asynchronously, if I catch the exception and log the bad batch, can I locate this wrong batch?
like this:
try:
batch = inference_queue.get(block=True)
with torch.no_grad():
input_ids = torch.tensor(batch['input_ids'],
dtype=torch.long).to(device)
attention_mask = torch.tensor(batch['attention_mask'],
dtype=torch.long).to(device)
inputs = {
"input_ids": input_ids,
"attention_mask": attention_mask
}
logits = model(**inputs)[0]
probs = softmax(logits)
probs = probs.cpu().numpy()
except Exception as e:
logger.error(f'{traceback.format_exc()}, DoInference is dead!')
with open('./BadBatch.pkl', 'wb') as f:
pickle.dump(batch, f)
logger.error(f'Bad batch recorded!')
You could run the script with the aforementioned env variable, which would point to the operation raising the error.
Your approach could work, but note that once you are running into an assert the CUDA context might be corrupted and I don’t know if you would be able to store any additional data.
could you describe what was your solution to that error? I am also facing CUDA error: device-side assert triggered error when running inference with yolov8’s tracking. Thank you.
Have the same error while trying to resume YOLOV8:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I encountered the same error:
Steps which helped me to resolve the error:
write these two lines in the first file being executed. It helped to show the true/exact line of code in the current file which was creating problem.
import os
os.environ[“CUDA_LAUNCH_BLOCKING”] = “1”
For my case error was in line:
channel_select_filtered_positive = all_filtered.view(-1)[indices.long()].view(1, height, width)
change to
channel_select_filtered_positive = all_filtered.view(-1)[indices.int()].view(1, height, width)
resolved the error.
I would recommend understanding the fix in detail as it seems the transformation from long to int and thus also the corresponding numerical range reduction “fixed” the issue while it seems the clipping is a side effect.
I’m facing the same issue in my Transformer block’s forward pass in Llama3.
Using the CUDA debugging env I get the error line detected and in fact it was:
def forward(self, tokens: torch.Tensor):
"""Perform a forward pass through the Transformer model.
Args:
tokens (torch.Tensor): Input token indices.
Returns:
torch.Tensor: Output logits after applying the Transformer model.
"""
# ERROR RAISES HERE
# passthrough for nonexistent layers, allows easy configuration of pipeline parallel stages
h = self.tok_embeddings(tokens) if self.tok_embeddings else tokens
for layer in self.layers.values():
h = layer(h, self.freqs_cis)
h = self.norm(h) if self.norm else h
return self.output(h).float() if self.output else h
It seems that when using any tokenizer from Hugging Face, it must be initialized and run on a CPU device. Running it directly on a GPU device might cause device-side assertion errors.
To potentially resolve this issue, I would suggest modifying your code as follows:
Please change your code as below