Hello everyone
Im running multiple models in sequence with different hyperparameter settings and sometimes, a device-site error is occuring in the loss in one of the models. This causes all subsequent models to not run, because the error device-site assert is also triggered in their run somehow without the model actually causing it. .
I would like to catch the device-site assert and just continue training with my subsequent models. How would i achieve this?
Normal catching the error does not work.
Im training on GPU.
Any help is appreciated.
Thank you
Hilmar