Hi everyone,
I’ve been reading a lot of posts here recently, but none of them helped me so I decided to write my own problem.
I am trying to work with a ResNet50 model and I wrote some scripts that used to work in my CPU. I bought a NVIDIA A40 to speed up those scripts, but I’m not able to run the training anymore using GPU.
Training works for two epochs (at most) and it gives:
cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
right in the moment of ‘loss.backward()’. I’ve got to this error running my code with ‘CUDA_LAUNCH_BLOCKING’ = 1, before I set up that flag the error used to be:
cuDNN error: CUDNN_STATUS_MAPPING_ERROR
Furthermore, after the error if I open a terminal, the command ‘nvidia-smi’ gives me:
‘Unable to determine the device handle for GPU0000:86:00.0: Unknown Error’
Sorry for not providing a minimal code to reproduce the errors, but I haven’t figure out how to do it since 95% of the script is about reading my personal dataset. I’ve tried to lower my batch_size, but I keep getting the same error.
I can provide you the output of python3 -m torch.utils.collect_env:
ollecting environment information…
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.10.0-21-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A40
Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchvision==0.14.1
[conda] Could not collect
And my ‘nvidia-smi’ output (before running the code and getting my error):
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:86:00.0 Off | 0 |
| 0% 39C P8 23W / 300W| 4MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1342 G /usr/lib/xorg/Xorg 4MiB |
±--------------------------------------------------------------------------------------+
Any help would be really appreciated. Thanks in advance.
PS: I don’t work with Anaconda at all and I’m running my script via Jupyter Notebook.