Is there any way that I can "bypass" the device-side assert triggered error?

Hi. This might sound like a bit of a strange question, but I’m wondering if there’s any way that I can simply skip over samples that cause the device-side assert triggered error.

There are a few samples within my dataset that cause that error, and I’m wondering if I have to find the error and debug it or reformulate the data; or if I can simply do something like a try-except-continue block within my training loop in order to bypass it.

I’ve tried doing precisely that and my program still crashes with the error. Is it even possible to bypass it?

You shouldn’t disable device assert errors, as they are pointing towards critical errors in your code.
E.g. you might run into an illegal memory access (the PyTorch device assert statements should trigger the assertion before hitting the memory violation), which might corrupt your data in all imaginable ways.
E.g. in case you are indeed finding a way to disable all asserts, your code might use an illegal indexing value and the (now faulty) indexing kernel might write (any) value to this invalid memory location.
If your data (e.g. a forward activation) would be stored there, it might be overwritten, which might even result in random NaN values during your training.
In short: don’t do it and fix your code :wink:

1 Like