YOLOv7 RuntimeError: CUDA error: unknown error

Getting the following error with my current setup after following this guide when trying to train a custom model for YOLOv7: CUDA on WSL :: CUDA Toolkit Documentation.

OS: Ubuntu (Windows 10 WSL)
Hardware: 16gb RAM, RTX 3070
Python Version: 3.8.10
Driver Version: 517.48
PyTorch Version:

>>> import torch
>>> print(torch.__version__)
>>> import torchvision
>>> print(torchvision.__version__)

Cuda Version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

Some common issues that have solved this issue for other people that I’ve attempted:

Docker Container:
sudo docker run --name yolov7 --gpus all -it -v "/mnt/c/coco/":"/coco/" -v "/mnt/c/yolov7/":"/yolov7/" --shm-size=16gb nvcr.io/nvidia/pytorch:21.08-py3

The part that I don’t understand is that torch appears to be operating correctly:

>>> import torch
>>> print(torch.cuda.current_device())
>>> torch.rand(1)


| NVIDIA-SMI 515.76.02    Driver Version: 517.48       CUDA Version: 11.7     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:2B:00.0  On |                  N/A |
|  0%   46C    P8    27W / 240W |    423MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

From Within Docker Container:
python train.py --workers 1 --device 0 --batch-size 2 --data data/coco.yaml --img-size 1920 --cfg cfg/training/yolov7.yaml --weights 'yolov7_training.pt' --name yolov7 --hyp data/hyp.scratch.custom.yaml

The above call produces the following error:

Traceback (most recent call last):
  File "train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 361, in train
    pred = model(imgs)  # forward
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/yolov7/models/yolo.py", line 599, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "/yolov7/models/yolo.py", line 625, in forward_once
    x = m(x)  # run
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/yolov7/models/common.py", line 108, in forward
    return self.act(self.bn(self.conv(x)))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 395, in forward
    return F.silu(input, inplace=self.inplace)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1901, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

“Unknown error” is often pointing towards an invalid setup. Do you see any Xid entries in dmesg and was this WSL setup working before at one point?
Also, are you able to run any CUDA sample inside the container in your WSL terminal?

I thought I added this in my explanation but the error only occurs when I set the batch size to any value greater than 1. If it is set to 1 then the training completes without error.

I might be missing the explanation, as you’ve only mentioned that creating a CPUTensor with a single value works correctly, which seems unrelated to this problem.
Which repository are you using and could you post a minimal, executable code snippet to reproduce it on your device?