Getting the following error with my current setup after following this guide when trying to train a custom model for YOLOv7: CUDA on WSL :: CUDA Toolkit Documentation.
OS: Ubuntu (Windows 10 WSL)
Hardware: 16gb RAM, RTX 3070
Python Version: 3.8.10
Driver Version: 517.48
PyTorch Version:
>>> import torch
>>> print(torch.__version__)
1.10.0a0+3fd9dcf
>>> import torchvision
>>> print(torchvision.__version__)
0.11.0a0
Cuda Version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0
Some common issues that have solved this issue for other people that I’ve attempted:
- Restarting the computer
- apt-get install nvidia-modprobe (RuntimeError: CUDA unknown error · Issue #49081 · pytorch/pytorch · GitHub)
Docker Container:
sudo docker run --name yolov7 --gpus all -it -v "/mnt/c/coco/":"/coco/" -v "/mnt/c/yolov7/":"/yolov7/" --shm-size=16gb nvcr.io/nvidia/pytorch:21.08-py3
The part that I don’t understand is that torch appears to be operating correctly:
>>> import torch
>>> print(torch.cuda.current_device())
0
>>> torch.rand(1)
tensor([0.3052])
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.76.02 Driver Version: 517.48 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:2B:00.0 On | N/A |
| 0% 46C P8 27W / 240W | 423MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
From Within Docker Container:
python train.py --workers 1 --device 0 --batch-size 2 --data data/coco.yaml --img-size 1920 --cfg cfg/training/yolov7.yaml --weights 'yolov7_training.pt' --name yolov7 --hyp data/hyp.scratch.custom.yaml
The above call produces the following error:
Traceback (most recent call last):
File "train.py", line 616, in <module>
train(hyp, opt, device, tb_writer)
File "train.py", line 361, in train
pred = model(imgs) # forward
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/yolov7/models/yolo.py", line 599, in forward
return self.forward_once(x, profile) # single-scale inference, train
File "/yolov7/models/yolo.py", line 625, in forward_once
x = m(x) # run
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/yolov7/models/common.py", line 108, in forward
return self.act(self.bn(self.conv(x)))
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 395, in forward
return F.silu(input, inplace=self.inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1901, in silu
return torch._C._nn.silu(input)
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.