Hey everyone,
I’ve been using TorchServe in production for over a year, handling several million requests weekly without issues. Recently, it started crashing every few days.
I’ve added more details in a github issue: CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) · Issue #3114 · pytorch/serve · GitHub
To diagnose, I set CUDA_LAUNCH_BLOCKING=1 and encountered a “CUDA error: device-side assert triggered” and a “CUDA out of memory” error when transferring data to the GPU.
Here are some details:
- I log torch.cuda.max_memory_allocated() and torch.cuda.memory_allocated().
- Typically, the models use about 6180 MiB out of 23028 MiB available.
- The torch.cuda.max_memory_allocated() logs around 366 MB.
- I suspected a memory leak, shape mismatches, or NaN values, but nothing stood out. However, I’ve added a lot of checks to make sure it’s not the case.
There are a few things that are a bit odd about this issue:
- The server has been running fine for over a year, I only made a few updates a few months back, and all of a sudden it started crashing frequently
- I thought it was some edge case that crashed the server, but it only crashes some of the running instances
- It happens randomly every 3-5 days, that’s why I assumed it was some memory leak, but I can’t find any evidence of it
- I get a device-side assert triggered, and CUDA out of memory, however the available memory seems to be plenty, and I check for any NaN value or wrong shape before placing it on the GPU.
I’ve run out of ideas, any thought or feedback would be much appreciated.