I keep having error on my 3090. It will crash randomly but always at some point . I’ve reproduce the error with a minimal example. I’m all out of idea tried on wsl, windows, ubuntu, I’ve tried reinstalling the drivers, pytorch, nvidia toolkit, etc … Any help would be greatly appreciated
Currrently using cuda toolkit 11.7 , pytorch 1.13 , nvidia driver version 525
import os
os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”
import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
for i in range(0,1000) :
data = torch.randn([512, 1024, 1, 1], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(1024, 1024, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=32)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
#====================================================
Traceback (most recent call last):
File “/home/jonathan/PycharmProjects/Adversarial_learning_paper_presentation/debug.py”, line 17, in
out.backward(torch.randn_like(out))
File “/home/jonathan/miniconda3/envs/Adversarial_learning_paper_presentation/lib/python3.9/site-packages/torch/_tensor.py”, line 488, in backward
torch.autograd.backward(
File “/home/jonathan/miniconda3/envs/Adversarial_learning_paper_presentation/lib/python3.9/site-packages/torch/autograd/init.py”, line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([512, 1024, 1, 1], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(1024, 1024, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=32)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
memory_format = Contiguous
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 32
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f3a140ca800
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 512, 1024, 1, 1,
strideA = 1024, 1, 1, 1,
output: TensorDescriptor 0x7f3a041b0800
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 512, 1024, 1, 1,
strideA = 1024, 1, 1, 1,
weight: FilterDescriptor 0x7f3a140b8d70
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 1024, 32, 3, 3,
Pointer addresses:
input: 0x7f3a5ec00000
output: 0x7f3a5ee00000
weight: 0x7f3a5f200000
Additional pointer addresses:
grad_output: 0x7f3a5ee00000
grad_weight: 0x7f3a5f200000
Backward filter algorithm: 5
Process finished with exit code 1