NVidia GPU Tesla K80 Illegal Memory Access

J_Johnson · October 1, 2022, 5:39am

I’m using the denoising diffusion library on Github here:

I’m getting the CUDA errors below during sampling(where a random noise image gets sent thru the model) but no error during training, which is odd.

I’ve tried turning off no_grad() during sampling.

I’ve tried with just 1 GPU.

I’ve tried with backends.cudnn = True/False.

GPU Type: Tesla K80
NVidia Driver Version: 470.141.03
CUDA Version: 11.4
Pytorch: 1.12.1+cu113

Traceback (most recent call last):
  File "/home/jj/PycharmProjects/pythonProject/train.py", line 34, in <module>
    trainer.train()
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 872, in train
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 872, in <lambda>
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 614, in sample
    return sample_fn((batch_size, channels, image_size, image_size))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 588, in ddim_sample
    pred_noise, x_start, *_ = self.model_predictions(img, time_cond, self_cond, clip_x_start = clip_denoised)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 524, in model_predictions
    model_output = self.model(x, t, x_self_cond)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 355, in forward
    x = self.init_conv(x)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7efe9fa7f20e in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x23a21 (0x7efe9faf6a21 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7efe9fafb9a7 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4637b8 (0x7efe8bf3a7b8 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7efe9fa667a5 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x1735345 (0x7efe641e6345 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::Reducer::~Reducer() + 0x1ef (0x7efe6721e98f in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7efe8c4aa4b2 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7efe8be361e8 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x9d6ae1 (0x7efe8c4adae1 in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3665ff (0x7efe8be3d5ff in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x3674ef (0x7efe8be3e4ef in /home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x138953 (0x55ff9f1db953 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #13: <unknown function> + 0x1e6b11 (0x55ff9f289b11 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #14: <unknown function> + 0x12c2f5 (0x55ff9f1cf2f5 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #15: <unknown function> + 0x269710 (0x55ff9f30c710 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #16: <unknown function> + 0x268ab6 (0x55ff9f30bab6 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #17: Py_FinalizeEx + 0x176 (0x55ff9f308206 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #18: Py_RunMain + 0x173 (0x55ff9f2f8033 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #19: Py_BytesMain + 0x2d (0x55ff9f2ce0dd in /home/jj/PycharmProjects/pythonProject/venv/bin/python)
frame #20: <unknown function> + 0x29d90 (0x7efea86b7d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7efea86b7e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x25 (0x55ff9f2cdfd5 in /home/jj/PycharmProjects/pythonProject/venv/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126902 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126903 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126904 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126905 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126906 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 126901) of binary: /home/jj/PycharmProjects/pythonProject/venv/bin/python
Traceback (most recent call last):
  File "/home/jj/PycharmProjects/pythonProject/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-01_12:20:57
  host      : jj-X9DRG-HF
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 126901)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 126901
=======================================================
Traceback (most recent call last):
  File "/home/jj/PycharmProjects/pythonProject/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 831, in launch_command
    multi_gpu_launcher(args)
  File "/home/jj/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 450, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '6', 'train.py']' returned non-zero exit status 1.

ptrblck · October 1, 2022, 5:47am

Your are running into an illegal memory access. Could you rerun your code with cuda-gdb or compute-sanitizer and post the stacktrace here, please?

J_Johnson · October 1, 2022, 7:05am

Thank you for your response. I’ve installed the CUDA Toolkit now. Normally I would run the script with:

accelerate launch train.py

How would I run it in either of those two?

ptrblck · October 1, 2022, 10:28am

Prepend compute-sanitizer to your command.

J_Johnson · October 1, 2022, 4:25pm

I’ve tried installing the cuda toolkit several times now. But it seems to have a conflict with the default drivers installed by Ubuntu. Whatever it installed by default does not come with either of those debuggers. Is there a way to install just compute-sanitizer without reinstalling CUDA?

ptrblck · October 1, 2022, 6:32pm

compute-sanitizer ships with the CUDA toolkit and can be installed from here.

J_Johnson · October 6, 2022, 4:18am

Managed to get compute-sanitizer installed and working. Here is the entire set of messages from execution (included the tag --launch-timeout 1000; of which the process/crash took less than a minute).

/pythonProject$ /usr/local/cuda-11.4/bin/compute-sanitizer accelerate launch train.py --launch-timeout 1000
========= COMPUTE-SANITIZER
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `16` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
36251779
36251779
36251779
36251779
36251779
36251779
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.
sampling loop time step:   0%|                                                       | 0/250 [00:00<?, ?it/s]
loss: 0.8630:   0%|                                                               | 0/700000 [00:29<?, ?it/s]
Traceback (most recent call last):
  File "/media/jj/Store1/PycharmProjects/pythonProject/train.py", line 34, in <module>
    trainer.train()
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 869, in train
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 869, in <lambda>
    all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 613, in sample
    return sample_fn((batch_size, channels, image_size, image_size))
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 588, in ddim_sample
    pred_noise, x_start, *_ = self.model_predictions(img, time_cond, self_cond, clip_x_start = clip_denoised)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 524, in model_predictions
    model_output = self.model(x, t, x_self_cond)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 355, in forward
    x = self.init_conv(x)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered
[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fb46a0c420e in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x23a21 (0x7fb462935a21 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7fb46293a9a7 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4637b8 (0x7fb44ed3a7b8 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fb46a0ab7a5 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x1735345 (0x7fb426fe6345 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::Reducer::~Reducer() + 0x1ef (0x7fb42a01e98f in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fb44f2aa4b2 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fb44ec361e8 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x9d6ae1 (0x7fb44f2adae1 in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3665ff (0x7fb44ec3d5ff in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x3674ef (0x7fb44ec3e4ef in /media/jj/Store1/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x138953 (0x558f4cae4953 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #13: <unknown function> + 0x1e6b11 (0x558f4cb92b11 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #14: <unknown function> + 0x12c2f5 (0x558f4cad82f5 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #15: <unknown function> + 0x269710 (0x558f4cc15710 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #16: <unknown function> + 0x268ab6 (0x558f4cc14ab6 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #17: Py_FinalizeEx + 0x176 (0x558f4cc11206 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #18: Py_RunMain + 0x173 (0x558f4cc01033 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #19: Py_BytesMain + 0x2d (0x558f4cbd70dd in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)
frame #20: <unknown function> + 0x29d90 (0x7fb46b5cad90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7fb46b5cae40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x25 (0x558f4cbd6fd5 in /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13621 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13622 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13623 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13624 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 13619) of binary: /media/jj/Store1/PycharmProjects/pythonProject/venv/bin/python

ptrblck · October 6, 2022, 5:16am

compute-sanitzer didn’t output anything:

========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

so it’s still unclear which kernel is causing the issue.
Could you try to post a minimal, executable code snippet which would reproduce the error as well as the output of python -m torch.utils.collect_env?

J_Johnson · October 6, 2022, 8:06am

Seems like it may be a driver/GPU issue. Because the same scripts work on other GPUs I’ve tested. And since Tesla K80s are no longer supported by NVidia for updates, probably I just need to find a good combo of CUDA/NVidia drivers/Pytorch that works. So far I’ve tried:

CUDA 11.4 / 470 drivers / Pytorch Stable 1.12.1 cu113.

Is there a good reference place I can find which CUDA / drivers / Pytorch are compatible? Should I reinstall with Cuda 11.3?

ptrblck · October 6, 2022, 8:29am

Could you try to keep the driver installed and use the PyTorch binary with the CUDA 10.2 runtime? Since you are using an older GPU I’m wondering if a CUDA Math lib might be running into an issue.

J_Johnson · October 7, 2022, 4:37am

Just to update, I installed:

Ubuntu 18.04 Server,
Cuda 10.2(came with 460 drivers) and
cu10.2 version of Pytorch.

No errors during sampling! Thank you for your help.

On a side note, the Ubuntu driver manager did not have this driver/cuda option. So I downloaded it from Nvidia, finding the appropriate version from here:

Hope this might help save someone some time in the future.

jbellis · December 6, 2022, 3:21am

Hey, thanks for this! Confirmed that basic pytorch also works with CUDA 10.2, pytorch 1.10, and python 3.9 on a K80, running ubuntu 20.04. For the record:

mamba install pytorch=1.10 torchvision torchaudio cudatoolkit=10.2 -c pytorch -c nvidia

I was pulling my hair out trying to make sense of the errors I was seeing with pytorch 1.12 and CUDA 11.3, which SHOULD be compatible with the last NVIDIA driver for the K80, but I guess it is not. (“cuDNN error: CUDNN_STATUS_EXECUTION_FAILED” and “Could not initialize NNPACK! Reason: Unsupported hardware.”)