CUDA is available, illegal memory access on cuda_synchronize

Hi all,

I am currently working on deploying a uvicorn model inference API to kubernetes, and I am running into a peculiar problem. According to torch.cuda.is_available(), my CUDA is ready to be used (it returns True), however as soon as I try to inference the model, I get an internal server error telling me that there is an illegal memory access.

For context, I am trying to find license plates on grayscale images.

I cannot seem to get CUDA_LAUNCH_BLOCKING=1 working on the container, no matter which way I put it into the environment variables or the os.environ, so I am only getting the “illegal memory access” as an error.

Here are a few things I have already tried:

  • Different CUDA versions (11.3 and 11.6 specifically)
  • Garbage collection and clearing the cache

I have also tried transforming the numpy array to a tensor with the torch.from_numpyfunction, but that gave me a ValueError: not enough values to unpack (expected 4, got 2) error.

This is on a kubernetes container where I have deployed a docker container from a private container registry. Here is the output for the torch.utils.collect_env function:

PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.27.2
Libc version: glibc-2.31

Python version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-1112-azure-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 470.82.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] torch==1.12.1+cu116
[pip3] torchaudio==0.12.1+cu116
[pip3] torchvision==0.13.1+cu116
[conda] numpy                     1.25.2                   pypi_0    pypi
[conda] torch                     1.12.1+cu116             pypi_0    pypi
[conda] torchaudio                0.12.1+cu116             pypi_0    pypi
[conda] torchvision               0.13.1+cu116             pypi_0    pypi

Here is a minimalistic version of the main script, which throws the same error:

import torch
import base64
import cv2
import numpy as np

model = torch.hub.load('ultralytics/yolov5', 'custom', path='', force_reload=False)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():

def find_plate(image_content):
    img_equalized = read_request_image(image_content)
    img_tensor = torch.from_numpy(img_equalized).float().to(device) # throws ValueError if used as input for the model, expected 4 values got 2
    results = model(img_equalized, size=1024)
    return {
        'results': str(results),
        'device': device,
        'tensor': img_tensor.device

def read_request_image(image_content):
    im_bytes = base64.b64decode(image_content)
    im_arr = np.frombuffer(im_bytes, dtype=np.uint8)  # im_arr is one-dim Numpy array
    img = cv2.imdecode(im_arr, flags=cv2.IMREAD_GRAYSCALE)
    return img

if __name__ == '__main__':
    plate_base64 = '<base64 of image containing license plate>'
    result = find_plate(plate_base64)

If I turn off the GPU for the container, the main script works without issues but it logically flows through the CPU and is thus much slower. What can I do to resolve this issue?

Many thanks in advance, this is the last thing holding up our deployment so help would be very much appreciated.

Update to the latest stable or nightly release and check if you would still see the same error. If so, try to rerun your script via compute-sanitizer python args to narrow down which operation fails. If this doesn’t work, post a minimal and executable code snippet reproducing the issue.

1 Like

Turns out updating to Torch 2.0.1 (stable) already fixed it. The Kubernetes VMs were showing CUDA 11.4 so I was trying to stay close to the version that used 11.3 or 11.6, but this turned out not to be necessary.

Thanks a lot!