CUDA asynchronous error

X-Chen · July 3, 2022, 7:44pm

Hi! I want to do inference of a trained model with GPU. A CUDA asynchronous error is triggered if I feed the image one by one. However, if I input all the images simultaneously (set batch_size as the number of images), the inference works well. Does anyone know the reason? Thank you!

Here is my codes

class myDataset(Dataset):
    def __init__(self, images, transform):
        self.images = images
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = cv.resize(self.images[idx], (256, 256))
        image = self.transform(image)
        
        return image

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

If I set batch_size = 1:

imgLoader = DataLoader(myDataset(images, transform), shuffle=False, batch_size=1)

masks = []
for img in imgLoader:
    img_cuda = img.to(device)
    print('---------\n')
    masks.append(unet(img_cuda))
    print('---------\n')

This outputs error:

---------

---------

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_2612/2454743386.py in <module>
      1 masks = []
      2 for img in imgLoader:
----> 3     img_cuda = img.to(device)
      4     print('---------\n')
      5     masks.append(unet(img_cuda))

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

If I set batch_size = 21, which is the total number of my images

imgLoader = DataLoader(myDataset(images, transform), shuffle=False, batch_size=21)

#masks = []
for img in imgLoader:
    img_cuda = img.to(device)
    print('---------\n')
    masks = unet(img_cuda)
    print('---------\n')

Everything works fine. Output is:

---------

---------

Here are the version of pytorch and GPU driver:

torch.__version__
>> '1.12.0'

torch.version.cuda
>> '11.3'

!nvidia-smi
>>
Sun Jul  3 19:52:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    73W / 149W |  11275MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2666      C   ...vs/pytorch_env/bin/python    11270MiB |
+-----------------------------------------------------------------------------+

ptrblck · July 3, 2022, 7:48pm

A memory violation is triggered for the small input shape.
Which PyTorch release are you using? If it’s an older one, could you update to the latest stable or nightly release (with the latest CUDA runtime)?

If you are still seeing the error, could you post a minimal executable code snippet as well as the output of python -m torch.utils.collect_env?

X-Chen · July 3, 2022, 7:55pm

python -m torch.utils.collect_env
>>
Collecting environment information...
PyTorch version: 1.12.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1127.8.2.el7.x86_64-x86_64-with-centos-7.9.2009-Core
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] torch==1.12.0
[pip3] torchvision==0.13.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py37h7f8727e_0  
[conda] mkl_fft                   1.3.1            py37hd3c417c_0  
[conda] mkl_random                1.2.2            py37h51133e4_0  
[conda] numpy                     1.21.5           py37h6c91a56_3  
[conda] numpy-base                1.21.5           py37ha15fc14_3  
[conda] pytorch                   1.12.0          py3.7_cuda11.3_cudnn8.3.2_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchvision               0.13.0               py37_cu113    pytorch

X-Chen · July 3, 2022, 7:57pm

Thanks for the quick reply! I should have updated pytorch to the latest version.

ptrblck · July 3, 2022, 8:07pm

Thanks for the information. Could you post a minimal, executable code snippet reproducing the issue?

X-Chen · July 3, 2022, 8:23pm

Sure! But I didn’t do that before. What should provide? Do I need to post all the codes of the model?

ptrblck · July 3, 2022, 10:51pm

The model definition with random tensors passed to it might be enough if this would reproduce the error.
It would be important to be able to copy-paste the code and be able to run it and the error should of course be raised using this code snippet in your setup.

X-Chen · July 11, 2022, 1:14am

Hello! Sorry for the late reply. Following is the code snippet:

import torch 
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

import cv2 as cv
import numpy as np

import requests

# def pytorch dataset
class myDataset(Dataset):
    def __init__(self, images, transform):
        self.images = images
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = cv.resize(self.images[idx], (256, 256))
        image = self.transform(image)
        
        return image


# download unet model
r = requests.get('https://raw.githubusercontent.com/supervisely/supervisely/master/plugins/nn/unet_v2/src/unet.py')
open('unet_model.py', 'wb').write(r.content)

# import model
from unet_model import construct_unet

# construct random image 400x400x3
images = [np.random.randint(0, 255, (400, 400, 3), dtype=np.uint8) for _ in range(21)]

# map model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
unet = construct_unet(5)
unet.to(device)
unet.eval();

# if batch is 21, i.e., all images input at once, no error happens. 
# Otherwise asynchronous error
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

imgLoader = DataLoader(myDataset(images, transform), shuffle=False, batch_size=1)

# inference
masks = []
for img in imgLoader:
    img_cuda = img.to(device)
    print('---------\n')
    masks.append(unet(img_cuda))
    print('---------\n')

X-Chen · July 14, 2022, 12:00am

ok. I don’t think the error comes from the model. I used a Pytorch resnet18 instead and find the same problem.

Here are the codes:

resnet = models.resnet18(pretrained=True)
num_ftrs = resnet.fc.in_features
resnet.fc = nn.Linear(num_ftrs, 5)

resnet.to(device)
resnet.eval();

masks = []
for img in imgLoader:
    img_cuda = img.to(device)
    print('---------\n')
    masks.append(resnet(img_cuda))
    print('---------\n')

Nothing else is changed and the same error happens.

X-Chen · July 18, 2022, 9:00pm

A few more updates:

I run the script on CPU and it can run.

ptrblck · July 18, 2022, 9:54pm

Unfortunately, I cannot reproduce the issue on any newer hardware and don’t have access to a K80.
Maybe you could try to install the PyTorch binary with CUDA10.2 and rerun it.

X-Chen · July 18, 2022, 10:28pm

it’s ok. I run the codes on a GeForce, and it works.

BTW I forgot to add torch.inference_mode() in my code. If I do not add it, a new CUDA error “unknown error” will be triggered.

X-Chen · July 18, 2022, 10:30pm

Did you run the original codes without “with torch.inference_mode()” and didn’t meet any error? I have errors unless I add “with torch.inference_mode()”