Random CUDA error: device-side assert triggered (once every week)

Hey everyone,

I’ve been using TorchServe in production for over a year, handling several million requests weekly without issues. Recently, it started crashing every few days.

I’ve added more details in a github issue: CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) · Issue #3114 · pytorch/serve · GitHub

To diagnose, I set CUDA_LAUNCH_BLOCKING=1 and encountered a “CUDA error: device-side assert triggered” and a “CUDA out of memory” error when transferring data to the GPU.

Here are some details:

  • I log torch.cuda.max_memory_allocated() and torch.cuda.memory_allocated().
  • Typically, the models use about 6180 MiB out of 23028 MiB available.
  • The torch.cuda.max_memory_allocated() logs around 366 MB.
  • I suspected a memory leak, shape mismatches, or NaN values, but nothing stood out. However, I’ve added a lot of checks to make sure it’s not the case.

There are a few things that are a bit odd about this issue:

  • The server has been running fine for over a year, I only made a few updates a few months back, and all of a sudden it started crashing frequently
  • I thought it was some edge case that crashed the server, but it only crashes some of the running instances
  • It happens randomly every 3-5 days, that’s why I assumed it was some memory leak, but I can’t find any evidence of it
  • I get a device-side assert triggered, and CUDA out of memory, however the available memory seems to be plenty, and I check for any NaN value or wrong shape before placing it on the GPU.

I’ve run out of ideas, any thought or feedback would be much appreciated.

Before the device-side assert triggered error, I get 32 warnings which probably overloads the memory and causes the problem:

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [28,0,0], thread: [53,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [28,0,0], thread: [54,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [28,0,0], thread: [55,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

You are running into an indexing error and should isolate which line of code calls into this scatter/gather kernel.

Try to correlate these changes to any indexing, scatter, gather operation and verify these ops are using valid indices.

I don’t think the OOM is real and just reported after running into the sticky assert, so I would ignore it for now and focus in the indexing error, which is real.