Issue with Fine-Tuning SageFormer Model: CUDA Error

Scbas_scias · March 1, 2024, 10:32am

I’m encountering a problem while fine-tuning the SageFormer model from Hugging Face in a Colab notebook. After a certain number of epochs, I consistently encounter the following error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

My training data consists of images sized at (640, 640, 3), and I’m using an image processor to align them with the model. Despite trying various approaches, I’m still unable to resolve this issue.

checkpoint = "nvidia/mit-b0"
image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=False)

Could anyone offer insights into why this error might be occurring and how I could potentially troubleshoot it?

ptrblck · March 2, 2024, 6:00pm

Rerun your script via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the stacktrace to isolate the failing method. Often these kind of issues are raised in an embedding layer receiving inputs containing values of of the expected range ([0, num_embeddings-1]).