NVidia GPU Tesla K80 Illegal Memory Access

I’m using the denoising diffusion library on Github here:

I’m getting the CUDA errors below during sampling(where a random noise image gets sent thru the model) but no error during training, which is odd.

I’ve tried turning off no_grad() during sampling.

I’ve tried with just 1 GPU.

I’ve tried with backends.cudnn = True/False.

GPU Type: Tesla K80
NVidia Driver Version: 470.141.03
CUDA Version: 11.4
Pytorch: 1.12.1+cu113

Traceback (most recent call last):
  File "/home/jj/PycharmProjects/pythonProject/train.py", line 34, in <module>
    trainer.train()
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 872, in train
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 872, in <lambda>
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 614, in sample
    return sample_fn((batch_size, channels, image_size, image_size))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 588, in ddim_sample
    pred_noise, x_start, *_ = self.model_predictions(img, time_cond, self_cond, clip_x_start = clip_denoised)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 524, in model_predictions
    model_output = self.model(x, t, x_self_cond)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 355, in forward
    x = self.init_conv(x)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7efe9fa7f20e in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x23a21 (0x7efe9faf6a21 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7efe9fafb9a7 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4637b8 (0x7efe8bf3a7b8 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7efe9fa667a5 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x1735345 (0x7efe641e6345 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::Reducer::~Reducer() + 0x1ef (0x7efe6721e98f in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7efe8c4aa4b2 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7efe8be361e8 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x9d6ae1 (0x7efe8c4adae1 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3665ff (0x7efe8be3d5ff in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x3674ef (0x7efe8be3e4ef in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x138953 (0x55ff9f1db953 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #13: <unknown function> + 0x1e6b11 (0x55ff9f289b11 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #14: <unknown function> + 0x12c2f5 (0x55ff9f1cf2f5 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #15: <unknown function> + 0x269710 (0x55ff9f30c710 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #16: <unknown function> + 0x268ab6 (0x55ff9f30bab6 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #17: Py_FinalizeEx + 0x176 (0x55ff9f308206 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #18: Py_RunMain + 0x173 (0x55ff9f2f8033 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #19: Py_BytesMain + 0x2d (0x55ff9f2ce0dd in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #20: <unknown function> + 0x29d90 (0x7efea86b7d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7efea86b7e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x25 (0x55ff9f2cdfd5 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126902 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126903 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126904 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126905 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126906 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 126901) of binary: /home/jj/PycharmProjects/pythonProject/venv/bin/python
Traceback (most recent call last):
  File "/home/jj/PycharmProjects/pythonProject/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-01_12:20:57
  host      : jj-X9DRG-HF
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 126901)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 126901
=======================================================
Traceback (most recent call last):
  File "/home/jj/PycharmProjects/pythonProject/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 831, in launch_command
    multi_gpu_launcher(args)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 450, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '6', 'train.py']' returned non-zero exit status 1.

Your are running into an illegal memory access. Could you rerun your code with cuda-gdb or compute-sanitizer and post the stacktrace here, please?

Thank you for your response. I’ve installed the CUDA Toolkit now. Normally I would run the script with:

accelerate launch train.py

How would I run it in either of those two?

Prepend compute-sanitizer to your command.

I’ve tried installing the cuda toolkit several times now. But it seems to have a conflict with the default drivers installed by Ubuntu. Whatever it installed by default does not come with either of those debuggers. Is there a way to install just compute-sanitizer without reinstalling CUDA?

compute-sanitizer ships with the CUDA toolkit and can be installed from here.

Managed to get compute-sanitizer installed and working. Here is the entire set of messages from execution (included the tag --launch-timeout 1000; of which the process/crash took less than a minute).

/pythonProject$ /usr/local/cuda-11.4/bin/compute-sanitizer accelerate launch train.py --launch-timeout 1000
========= COMPUTE-SANITIZER
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `16` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
36251779
36251779
36251779
36251779
36251779
36251779
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.
sampling loop time step:   0%|                                                       | 0/250 [00:00<?, ?it/s]
loss: 0.8630:   0%|                                                               | 0/700000 [00:29<?, ?it/s]
Traceback (most recent call last):
  File "/media/jj/Store1/PycharmProjects/pythonProject/train.py", line 34, in <module>
    trainer.train()
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 869, in train
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 869, in <lambda>
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 613, in sample
    return sample_fn((batch_size, channels, image_size, image_size))
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 588, in ddim_sample
    pred_noise, x_start, *_ = self.model_predictions(img, time_cond, self_cond, clip_x_start = clip_denoised)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 524, in model_predictions
    model_output = self.model(x, t, x_self_cond)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 355, in forward
    x = self.init_conv(x)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered
[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fb46a0c420e in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x23a21 (0x7fb462935a21 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7fb46293a9a7 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4637b8 (0x7fb44ed3a7b8 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fb46a0ab7a5 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x1735345 (0x7fb426fe6345 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::Reducer::~Reducer() + 0x1ef (0x7fb42a01e98f in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fb44f2aa4b2 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fb44ec361e8 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x9d6ae1 (0x7fb44f2adae1 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3665ff (0x7fb44ec3d5ff in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x3674ef (0x7fb44ec3e4ef in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x138953 (0x558f4cae4953 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #13: <unknown function> + 0x1e6b11 (0x558f4cb92b11 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #14: <unknown function> + 0x12c2f5 (0x558f4cad82f5 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #15: <unknown function> + 0x269710 (0x558f4cc15710 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #16: <unknown function> + 0x268ab6 (0x558f4cc14ab6 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #17: Py_FinalizeEx + 0x176 (0x558f4cc11206 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #18: Py_RunMain + 0x173 (0x558f4cc01033 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #19: Py_BytesMain + 0x2d (0x558f4cbd70dd in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #20: <unknown function> + 0x29d90 (0x7fb46b5cad90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7fb46b5cae40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x25 (0x558f4cbd6fd5 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13621 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13622 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13623 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13624 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 13619) of binary: /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python

compute-sanitzer didn’t output anything:

========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

so it’s still unclear which kernel is causing the issue.
Could you try to post a minimal, executable code snippet which would reproduce the error as well as the output of python -m torch.utils.collect_env?

Seems like it may be a driver/GPU issue. Because the same scripts work on other GPUs I’ve tested. And since Tesla K80s are no longer supported by NVidia for updates, probably I just need to find a good combo of CUDA/NVidia drivers/Pytorch that works. So far I’ve tried:

CUDA 11.4 / 470 drivers / Pytorch Stable 1.12.1 cu113.

Is there a good reference place I can find which CUDA / drivers / Pytorch are compatible? Should I reinstall with Cuda 11.3?

Could you try to keep the driver installed and use the PyTorch binary with the CUDA 10.2 runtime? Since you are using an older GPU I’m wondering if a CUDA Math lib might be running into an issue.

Just to update, I installed:

  1. Ubuntu 18.04 Server,
  2. Cuda 10.2(came with 460 drivers) and
  3. cu10.2 version of Pytorch.

No errors during sampling! Thank you for your help.

On a side note, the Ubuntu driver manager did not have this driver/cuda option. So I downloaded it from Nvidia, finding the appropriate version from here:

Hope this might help save someone some time in the future.

2 Likes