CUDA error: unknown error. CUDA kernel errors might be asynchronously reported at some other API call

quyujunbbb · January 14, 2022, 3:48am

Hi all,

I am a beginner of PyTorch and CV. I encounter a problem when trying to use mmaction2 to extract features from video clips. Following the tutorial from here, I tried to run a single video test and my command is

python3 tools/misc/clip_feature_extraction.py \
configs/recognition/i3d/i3d_r50_video_32x2x1_100e_kinetics400_rgb.py \
pretrained/i3d_r50_video_32x2x1_100e_kinetics400_rgb_20200826-e31c6f52.pth \
--video-list examples/inputs/video_list_single.txt \
--video-root examples/inputs/video \
--out examples/outputs/examples_feature.pkl

However, I got the a RuntimeError: CUDA error: unknown error.

load checkpoint from local path: pretrained/i3d_r50_video_32x2x1_100e_kinetics400_rgb_20200826-e31c6f52.pth
[                                                  ] 0/1, elapsed: 0s, ETA:Traceback (most recent call last):
  File "tools/misc/clip_feature_extraction.py", line 229, in <module>
    main()
  File "tools/misc/clip_feature_extraction.py", line 217, in main
    outputs = inference_pytorch(args, cfg, distributed, data_loader)
  File "tools/misc/clip_feature_extraction.py", line 118, in inference_pytorch
    outputs = single_gpu_test(model, data_loader)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/engine/test.py", line 33, in single_gpu_test
    result = model(return_loss=False, **data)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/base.py", line 264, in forward
    return self.forward_test(imgs, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/recognizer3d.py", line 99, in forward_test
    return self._do_test(imgs).cpu().numpy()
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/recognizer3d.py", line 63, in _do_test
    feat = self.extract_feat(imgs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/base.py", line 163, in extract_feat
    x = self.backbone(imgs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/backbones/resnet3d.py", line 854, in forward
    x = res_layer(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/backbones/resnet3d.py", line 318, in forward
    out = _inner_forward(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/backbones/resnet3d.py", line 305, in _inner_forward
    out = self.conv1(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/cnn/bricks/conv_module.py", line 201, in forward
    x = self.conv(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/cnn/bricks/wrappers.py", line 80, in forward
    return super().forward(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 590, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 586, in _conv_forward
    input, weight, bias, self.stride, self.padding, self.dilation, self.groups
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

If I set CUDA_LAUNCH_BLOCKING=1, i.e., CUDA_LAUNCH_BLOCKING=1 python3 ..., nothing more is shown.

I am not sure what causes the error, but I guess might be CUDA or PyTorch setup problems, since the codes can work properly on the other machine. FYI, I list the environment of the two machine below.

	Device 1 (has error)	Device 2 (no error)
Platform	WSL2, Ubuntu 20.04.3	WSL2, Ubuntu 20.04.3
GPU	GeForce GTX 1080 Ti, Driver=510.06, CUDA=11.6	GeForce RTX 2060, Driver=510.06, CUDA=11.6
PyTorch	pytorch=1.10.1, py=3.7, cuda=11.3.1	pytorch=1.10.1, py=3.7, cuda=11.3.1

My question is what causes the error and how I can fix it? Thanks very much.

ptrblck · January 14, 2022, 7:44pm

Are you able to build and run any CUDA examples in the first setup at all or are all CUDA applications crashing?

dexuan_meng · March 8, 2022, 3:55pm

I got in this problem as well. I have many times run with CUDA normally before. This time it seems because I didn’t terminate other processes in pycharm consoles. This is what I want to ask. How to terminate the process in pycharm console correctly when running pytorch.

Omar_Samir · September 21, 2022, 7:02pm

I got the same error and it’s all CUDA applications

ptrblck · September 21, 2022, 7:06pm

This would point to a general issue in your setup, so try to reinstall the drivers and make sure CUDA applications can work before rerunning the PyTorch workload.

Omar_Samir · September 21, 2022, 7:25pm

I am actually training a network and everything works just fine for the first ~25 epochs, in a random epoch after 25, this error comes up. I don’t think there’s anything special about the epoch the error comes up at, so confused!

ptrblck · September 21, 2022, 9:34pm

I don’t fully understand your issue then as you’ve mentioned the same error is seen in all CUDA applications. Could you describe the issue in more detail, what exactly is working, what is failing when, and check for any Xid messages in dmesg?

ConvolutionalAtom · February 7, 2023, 9:02am

I came across the same error message. For my case, I used Resize transform, which led to bigger images and more memory requirement, hence out of memory. Then, I tried smaller batch size, it worked. Maybe, there should be a more informative error message ??

ptrblck · February 7, 2023, 9:41am

I get the expected OOM error if Resize fails to allocate the GPU memory:

import torch
import torchvision.transforms as transforms

x = torch.randn(3, 224, 224, device="cuda")
out = transforms.Resize(1000000)(x)

Output:

...
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3952, in interpolate
    return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11175.87 GiB
...

Which error message did you receive?

ConvolutionalAtom · February 7, 2023, 9:57am

My resizing was not that big.
The resizing was: (32, 32) -> (224, 224), width and height dimensions .
My error is the same as OP’s error

RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Batch_size = 8,
Nvidia-smi if you want (not during training process)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 473.34       Driver Version: 473.34       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A200... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P0    14W /  N/A |    106MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

ptrblck · February 7, 2023, 10:00am

In case you are not using a current nightly release, you might need to update to it, as the last stable release 1.13 had some issues where valid device asserts were not properly raised.

RachelGomez161999 · April 24, 2023, 9:20am

The error message “CUDA error: unknown error” suggests that an unknown error has occurred during the execution of a CUDA kernel. This error message is a generic error message that can occur for various reasons, such as out-of-memory errors, invalid memory accesses, or programming errors.

The additional message “CUDA kernel errors might be asynchronously reported at some other API call” means that the error may have occurred during the execution of a previous CUDA kernel and may have been reported at a later API call.

To diagnose and fix the error, you can try the following steps:

Check your code for any memory access violations or programming errors. These can lead to unknown errors during kernel execution.

Make sure that you have enough memory available on your GPU for the kernel to execute. You can check the available memory using the nvidia-smi command.

Check if you are using the correct CUDA version for your GPU and your code. Make sure that your code is compiled with the correct version of CUDA.

Update your GPU drivers to the latest version.

Try reducing the workload of the kernel to see if that resolves the issue.

Use CUDA error checking to locate the specific line of code causing the error. You can enable CUDA error checking by adding cudaDeviceSynchronize(); after the kernel call and checking for errors using cudaGetLastError().

If none of the above steps work, try running your code on a different GPU or system to see if the error persists.

By following these steps, you should be able to diagnose and fix the “CUDA error: unknown error” issue.

Regards,
Rachel Gomez