Catch device-site assert

Hello everyone

Im running multiple models in sequence with different hyperparameter settings and sometimes, a device-site error is occuring in the loss in one of the models. This causes all subsequent models to not run, because the error device-site assert is also triggered in their run somehow without the model actually causing it. .
I would like to catch the device-site assert and just continue training with my subsequent models. How would i achieve this?
Normal catching the error does not work.
Im training on GPU.

Any help is appreciated.

Thank you

Hilmar

That’s not possible since a device assert could corrupt the CUDA context. Every following CUDA operation could reraise the same error, raise a new one, or result in UB.
Check what’s causing the device assert and fix it instead.