Practical tips for "RuntimeError: CUDA error: device-side assert triggered"

There are several posts already about this error, and I’ve encountered it many times, but it’s still often difficult to solve.

Partly this is because the stack trace is normally incorrect. As pointed out elsewhere, the right stack trace can be obtained using CUDA_LAUNCH_BLOCKING=1, but then I need to rerun everything, which could take several hours if the error only appears at say epoch 70.

A second reason why this is an awkward error is that pdb is not much use. When I break into execution, either using breakpoint() or python -m pdb then, once the error occurs, I can no longer access or create any cuda tensors, so I can’t find out where the offending index was coming from (this is almost always caused by an out-of-bounds index somewhere).

Is there a way to get pdb to work here? More generally, are there any tips for working with this error? I’ve gotten quicker at solving it but it still sometimes costs me half a day.

Also, the error often occurs using cross-entropy loss with an out-of-bounds target. Would it be worth including a simple python assert to check the indices are correct inside F.cross_entropy (and similar functions) before passing off to cuda? Keeping the error within python would make debugging much easier.

Yes, that’s the right approach. Since CUDA operations are executed asynchronously, the stack trace could be wrong otherwise.

That’s expected as the device assert corrupts the CUDA context. Executing any CUDA operations could reraise the same or another error.

No, since this would synchronize your code and would yield a large performance hit.
You could use a CPU-only run in this case, which would yield a better stacktrace and error reporting or add manual asserts in case you cannot guarantee that your targets are inside the expected range.

Ah right, I see. I take it manual asserts will also give a performance hit. Is the same true for any python asserts that happen in between CUDA operations? I sometimes leave them in at various points during training, e.g. ensuring correct shapes or that certain values are in the expected range, perhaps should I use them only when specifically checking something and remove them afterwards?

You would have to check if a synchronization is needed, which e.g. would always be the case if you need to read the actual values of a tensor since the GPU has to finish the kernel execution while the CPU is waiting (synchronizing) to perform the check.
This section of the performance guide explains it in more detail with more examples.
E.g. each print statement would also need to synchronize the GPU (assuming you want to print a CUDATensor) since the actual values are needed again.

1 Like