Random CUDA error: device-side assert triggered (once every week)

You are running into an indexing error and should isolate which line of code calls into this scatter/gather kernel.

Try to correlate these changes to any indexing, scatter, gather operation and verify these ops are using valid indices.

I don’t think the OOM is real and just reported after running into the sticky assert, so I would ignore it for now and focus in the indexing error, which is real.