CUDA error: unknown error. CUDA kernel errors might be asynchronously reported at some other API call

Hi all,

I am a beginner of PyTorch and CV. I encounter a problem when trying to use mmaction2 to extract features from video clips. Following the tutorial from here, I tried to run a single video test and my command is

python3 tools/misc/clip_feature_extraction.py \
configs/recognition/i3d/i3d_r50_video_32x2x1_100e_kinetics400_rgb.py \
pretrained/i3d_r50_video_32x2x1_100e_kinetics400_rgb_20200826-e31c6f52.pth \
--video-list examples/inputs/video_list_single.txt \
--video-root examples/inputs/video \
--out examples/outputs/examples_feature.pkl

However, I got the a RuntimeError: CUDA error: unknown error.

load checkpoint from local path: pretrained/i3d_r50_video_32x2x1_100e_kinetics400_rgb_20200826-e31c6f52.pth
[                                                  ] 0/1, elapsed: 0s, ETA:Traceback (most recent call last):
  File "tools/misc/clip_feature_extraction.py", line 229, in <module>
    main()
  File "tools/misc/clip_feature_extraction.py", line 217, in main
    outputs = inference_pytorch(args, cfg, distributed, data_loader)
  File "tools/misc/clip_feature_extraction.py", line 118, in inference_pytorch
    outputs = single_gpu_test(model, data_loader)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/engine/test.py", line 33, in single_gpu_test
    result = model(return_loss=False, **data)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/base.py", line 264, in forward
    return self.forward_test(imgs, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/recognizer3d.py", line 99, in forward_test
    return self._do_test(imgs).cpu().numpy()
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/recognizer3d.py", line 63, in _do_test
    feat = self.extract_feat(imgs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/recognizers/base.py", line 163, in extract_feat
    x = self.backbone(imgs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/backbones/resnet3d.py", line 854, in forward
    x = res_layer(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/backbones/resnet3d.py", line 318, in forward
    out = _inner_forward(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmaction/models/backbones/resnet3d.py", line 305, in _inner_forward
    out = self.conv1(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/cnn/bricks/conv_module.py", line 201, in forward
    x = self.conv(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/cnn/bricks/wrappers.py", line 80, in forward
    return super().forward(x)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 590, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/xxx/miniconda3/envs/mmlab/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 586, in _conv_forward
    input, weight, bias, self.stride, self.padding, self.dilation, self.groups
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

If I set CUDA_LAUNCH_BLOCKING=1, i.e., CUDA_LAUNCH_BLOCKING=1 python3 ..., nothing more is shown.

I am not sure what causes the error, but I guess might be CUDA or PyTorch setup problems, since the codes can work properly on the other machine. FYI, I list the environment of the two machine below.

Device 1 (has error) Device 2 (no error)
Platform WSL2, Ubuntu 20.04.3 WSL2, Ubuntu 20.04.3
GPU GeForce GTX 1080 Ti, Driver=510.06, CUDA=11.6 GeForce RTX 2060, Driver=510.06, CUDA=11.6
PyTorch pytorch=1.10.1, py=3.7, cuda=11.3.1 pytorch=1.10.1, py=3.7, cuda=11.3.1

My question is what causes the error and how I can fix it? Thanks very much.

Are you able to build and run any CUDA examples in the first setup at all or are all CUDA applications crashing?

1 Like

I got in this problem as well. I have many times run with CUDA normally before. This time it seems because I didn’t terminate other processes in pycharm consoles. This is what I want to ask. How to terminate the process in pycharm console correctly when running pytorch.

I got the same error and it’s all CUDA applications

This would point to a general issue in your setup, so try to reinstall the drivers and make sure CUDA applications can work before rerunning the PyTorch workload.

I am actually training a network and everything works just fine for the first ~25 epochs, in a random epoch after 25, this error comes up. I don’t think there’s anything special about the epoch the error comes up at, so confused!

I don’t fully understand your issue then as you’ve mentioned the same error is seen in all CUDA applications. Could you describe the issue in more detail, what exactly is working, what is failing when, and check for any Xid messages in dmesg?