RuntimeError: CUDA error: device-side assert triggered\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

i run kohya lora trainining code and encounter CUDA index out of bounds problem,anyone can help me with this issue,thanks

Rerun the code with blocking launches and check the stacktrace to isolate the failing indexing operation. Once you know which op is failing, check the min./max. values of the index tensor as well as the shape of the tenors being indexed.

thank you, after i rerun the code,the training process succeed in the end,btw,how to monitor checking the stacktrace andisolating the failing indexing operation?i have no idea how to breakpoint the beginning of this issue,i just curious if the latent bucketing previous of lora training cause this problem

Run the code via CUDA_LAUNCH_BLOCKING=1 python args and check the stacktrace shown in your terminal to see which operation failed.

Hi, the same issue happens while activating flash_attention in MBart. Any help here?