I’m using the denoising diffusion library on Github here:
I’m getting the CUDA errors below during sampling(where a random noise image gets sent thru the model) but no error during training, which is odd.
I’ve tried turning off no_grad() during sampling.
I’ve tried with just 1 GPU.
I’ve tried with backends.cudnn = True/False.
GPU Type: Tesla K80
NVidia Driver Version: 470.141.03
CUDA Version: 11.4
Pytorch: 1.12.1+cu113
Traceback (most recent call last):
File "/home/jj/PycharmProjects/pythonProject/train.py", line 34, in <module>
trainer.train()
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 872, in train
all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 872, in <lambda>
all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 614, in sample
return sample_fn((batch_size, channels, image_size, image_size))
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 588, in ddim_sample
pred_noise, x_start, *_ = self.model_predictions(img, time_cond, self_cond, clip_x_start = clip_denoised)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 524, in model_predictions
model_output = self.model(x, t, x_self_cond)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 355, in forward
x = self.init_conv(x)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7efe9fa7f20e in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x23a21 (0x7efe9faf6a21 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7efe9fafb9a7 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4637b8 (0x7efe8bf3a7b8 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7efe9fa667a5 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x1735345 (0x7efe641e6345 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::Reducer::~Reducer() + 0x1ef (0x7efe6721e98f in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7efe8c4aa4b2 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7efe8be361e8 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x9d6ae1 (0x7efe8c4adae1 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3665ff (0x7efe8be3d5ff in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x3674ef (0x7efe8be3e4ef in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x138953 (0x55ff9f1db953 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #13: <unknown function> + 0x1e6b11 (0x55ff9f289b11 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #14: <unknown function> + 0x12c2f5 (0x55ff9f1cf2f5 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #15: <unknown function> + 0x269710 (0x55ff9f30c710 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #16: <unknown function> + 0x268ab6 (0x55ff9f30bab6 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #17: Py_FinalizeEx + 0x176 (0x55ff9f308206 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #18: Py_RunMain + 0x173 (0x55ff9f2f8033 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #19: Py_BytesMain + 0x2d (0x55ff9f2ce0dd in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #20: <unknown function> + 0x29d90 (0x7efea86b7d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7efea86b7e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x25 (0x55ff9f2cdfd5 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126902 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126903 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126904 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126905 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126906 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 126901) of binary: /home/jj/PycharmProjects/pythonProject/venv/bin/python
Traceback (most recent call last):
File "/home/jj/PycharmProjects/pythonProject/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-10-01_12:20:57
host : jj-X9DRG-HF
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 126901)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 126901
=======================================================
Traceback (most recent call last):
File "/home/jj/PycharmProjects/pythonProject/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 831, in launch_command
multi_gpu_launcher(args)
File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 450, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '6', 'train.py']' returned non-zero exit status 1.