CUDA is available, illegal memory access on cuda_synchronize

OlyMitch · August 15, 2023, 1:47pm

Hi all,

I am currently working on deploying a uvicorn model inference API to kubernetes, and I am running into a peculiar problem. According to torch.cuda.is_available(), my CUDA is ready to be used (it returns True), however as soon as I try to inference the model, I get an internal server error telling me that there is an illegal memory access.

For context, I am trying to find license plates on grayscale images.

I cannot seem to get CUDA_LAUNCH_BLOCKING=1 working on the container, no matter which way I put it into the environment variables or the os.environ, so I am only getting the “illegal memory access” as an error.

Here are a few things I have already tried:

Different CUDA versions (11.3 and 11.6 specifically)
Garbage collection and clearing the cache

I have also tried transforming the numpy array to a tensor with the torch.from_numpyfunction, but that gave me a ValueError: not enough values to unpack (expected 4, got 2) error.

This is on a kubernetes container where I have deployed a docker container from a private container registry. Here is the output for the torch.utils.collect_env function:

PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.27.2
Libc version: glibc-2.31

Python version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-1112-azure-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 470.82.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] torch==1.12.1+cu116
[pip3] torchaudio==0.12.1+cu116
[pip3] torchvision==0.13.1+cu116
[conda] numpy                     1.25.2                   pypi_0    pypi
[conda] torch                     1.12.1+cu116             pypi_0    pypi
[conda] torchaudio                0.12.1+cu116             pypi_0    pypi
[conda] torchvision               0.13.1+cu116             pypi_0    pypi

Here is a minimalistic version of the main script, which throws the same error:

import torch
import base64
import cv2
import numpy as np

model = torch.hub.load('ultralytics/yolov5', 'custom', path='best_1024.pt', force_reload=False)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
if torch.cuda.is_available():
    torch.cuda.empty_cache()


def find_plate(image_content):
    img_equalized = read_request_image(image_content)
    img_tensor = torch.from_numpy(img_equalized).float().to(device) # throws ValueError if used as input for the model, expected 4 values got 2
    results = model(img_equalized, size=1024)
    return {
        'results': str(results),
        'device': device,
        'tensor': img_tensor.device
    }


def read_request_image(image_content):
    im_bytes = base64.b64decode(image_content)
    im_arr = np.frombuffer(im_bytes, dtype=np.uint8)  # im_arr is one-dim Numpy array
    img = cv2.imdecode(im_arr, flags=cv2.IMREAD_GRAYSCALE)
    return img


if __name__ == '__main__':
    plate_base64 = '<base64 of image containing license plate>'
    result = find_plate(plate_base64)
    print(result)

If I turn off the GPU for the container, the main script works without issues but it logically flows through the CPU and is thus much slower. What can I do to resolve this issue?

Many thanks in advance, this is the last thing holding up our deployment so help would be very much appreciated.

ptrblck · August 15, 2023, 2:42pm

Update to the latest stable or nightly release and check if you would still see the same error. If so, try to rerun your script via compute-sanitizer python script.py args to narrow down which operation fails. If this doesn’t work, post a minimal and executable code snippet reproducing the issue.

OlyMitch · August 16, 2023, 8:33am

Turns out updating to Torch 2.0.1 (stable) already fixed it. The Kubernetes VMs were showing CUDA 11.4 so I was trying to stay close to the version that used 11.3 or 11.6, but this turned out not to be necessary.

Thanks a lot!