I’m encountering a problem while fine-tuning the SageFormer model from Hugging Face in a Colab notebook. After a certain number of epochs, I consistently encounter the following error:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
My training data consists of images sized at (640, 640, 3), and I’m using an image processor to align them with the model. Despite trying various approaches, I’m still unable to resolve this issue.
checkpoint = "nvidia/mit-b0"
image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=False)
Could anyone offer insights into why this error might be occurring and how I could potentially troubleshoot it?