3090 pytorch not working

I keep having error on my 3090. It will crash randomly but always at some point . I’ve reproduce the error with a minimal example. I’m all out of idea tried on wsl, windows, ubuntu, I’ve tried reinstalling the drivers, pytorch, nvidia toolkit, etc … Any help would be greatly appreciated

Currrently using cuda toolkit 11.7 , pytorch 1.13 , nvidia driver version 525

import os
os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”
import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True

for i in range(0,1000) :
data = torch.randn([512, 1024, 1, 1], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(1024, 1024, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=32)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

#====================================================
Traceback (most recent call last):
File “/home/jonathan/PycharmProjects/Adversarial_learning_paper_presentation/debug.py”, line 17, in
out.backward(torch.randn_like(out))
File “/home/jonathan/miniconda3/envs/Adversarial_learning_paper_presentation/lib/python3.9/site-packages/torch/_tensor.py”, line 488, in backward
torch.autograd.backward(
File “/home/jonathan/miniconda3/envs/Adversarial_learning_paper_presentation/lib/python3.9/site-packages/torch/autograd/init.py”, line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([512, 1024, 1, 1], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(1024, 1024, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=32)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
memory_format = Contiguous
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 32
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f3a140ca800
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 512, 1024, 1, 1,
strideA = 1024, 1, 1, 1,
output: TensorDescriptor 0x7f3a041b0800
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 512, 1024, 1, 1,
strideA = 1024, 1, 1, 1,
weight: FilterDescriptor 0x7f3a140b8d70
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 1024, 32, 3, 3,
Pointer addresses:
input: 0x7f3a5ec00000
output: 0x7f3a5ee00000
weight: 0x7f3a5f200000
Additional pointer addresses:
grad_output: 0x7f3a5ee00000
grad_weight: 0x7f3a5f200000
Backward filter algorithm: 5

Process finished with exit code 1

I cannot reproduce the issue using torch==1.13.1+cu117 on a 3090 on Linux.
If I understand your setup correctly you are using WSL(2) with the same wheels?
Was this setup working before or did you change something which caused these random crashes?

It never worked very well but it seems to crash faster. I’ve trie on WSL(2), ubuntu 22.10 and windows. I’m starting to think it might be a defective GPU… This exact code snippet was run on ubuntu 22.10

That could be the case and you could try to run some tests unrelated to PyTorch from e.g. the CUDA samples to check if these would also randomly crash.

Thanks for your answer . I’ve tried the cuda samples (well a lot of them anyway ) but couldn’t reproduce the issue …

Hi, did you solve the problem? I also met it using cuda toolkit 11.7, pytorch 1.13, nvidia driver version 525 on ubuntu 20.04.

python -m torch.utils.collect_env

Collecting environment information…
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 525.89.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-lightning==1.7.7
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchmetrics==0.11.4
[pip3] torchvision==0.14.1
[conda] blas 1.0 mkl defaults
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640 defaults
[conda] mkl-service 2.4.0 py39h7e14d7c_0 conda-forge
[conda] mkl_fft 1.3.1 py39h0c7bc48_1 conda-forge
[conda] mkl_random 1.2.2 py39hde0f152_0 conda-forge
[conda] numpy 1.23.5 py39h14f4228_0 defaults
[conda] numpy-base 1.23.5 py39h31eccc5_0 defaults
[conda] pytorch 1.13.1 py3.9_cuda11.7_cudnn8.5.0_0 pytorch
[conda] pytorch-cuda 11.7 h778d358_3 pytorch
[conda] pytorch-lightning 1.7.7 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.13.1 py39_cu117 pytorch
[conda] torchmetrics 0.11.4 pypi_0 pypi
[conda] torchvision 0.14.1 py39_cu117 pytorch

I finally just sent my GPU to the manufacturer and they have confirmed it to have failed some of their tests. The GPU was simply defective