CUDA memory error when trying to train model

agt · May 10, 2020, 7:21am

I’m trying to train a pytorch model pix2pix. They have an option to speed up training with “Automatic Mixed Precision” (AMP). But when I do that with python -m torch.distributed.launch train.py I get a CUDA error: “an illegal memory access was encountered”. The device is “pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0”.

It gives me 100 messages like:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05

And finally:

Traceback (most recent call last):
  File "train.py", line 85, in <module>
    with amp.scale_loss(loss_G, optimizer_G) as scaled_loss: scaled_loss.backward()                
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 127, in scale_loss
    should_skip = False if delay_overflow_check else loss_scaler.update_scale()
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/scaler.py", line 200, in update_scale
    self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--fp16', '--num_D', '1', '--name', 'n', '--dataroot', './datasets/n/', '--label_nc', '0', '--no_instance', '--resize_or_crop', 'scale_width_and_crop', '--save_epoch_freq', '2', '--checkpoints_dir', '/content/drive/My Drive/checkpoints', '--tf_log']' returned non-zero exit status 1.

How can I fix this?

ptrblck · May 10, 2020, 8:49am

Could you check, if your loss is creating a NaN output?
If so, are you seeing this NaN also if you disable apex/amp?

Also, we recommend to try out the PyTorch native amp support, which is available in the nightly binaries as torch.cuda.amp.

agt · May 10, 2020, 6:20pm

It does say NaN loss in the logs, but only with apex/amp. Training works fine without apex/amp. If I add torch.backends.cudnn.benchmark = False before training, it gives me a Zero Division error instead of the memory access error. Any idea what I can do? The pix2pix docs say it works.

ptrblck · May 11, 2020, 12:39am

Do you get the NaN output from the very first iteration or after a while?
Also, which dataset, PyTorch version, CUDA, and cudnn are you using?

agt · May 11, 2020, 1:34am

It happens right at the start.

Here’s the version info I get from:

print(torch.__version__)
nvcc --version
cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2

1.5.0+cu101
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
–
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include “driver_types.h”

The dataset is under 5k JPEG images. Hope this helps.

agt · May 12, 2020, 1:45am

Did you have any ideas?

ptrblck · May 12, 2020, 4:35am

I’ve talked to the author of the repository and he isn’t aware of such an issue.
I’ll try to reproduce it with your setup and will get back to you.
I think it might take a day or two, so please ping me again in case I don’t update it here.

agt · May 12, 2020, 4:55am

Alright, thank you!

More info I didn’t include: I crop the images by setting resize_or_crop to scale_width_and_crop. Maybe that has something to do with it.

ptrblck · May 12, 2020, 5:03am

Could you run the original code in the meantime without any modifications to exclude this potential issue?

agt · May 13, 2020, 5:05am

Yes I’ll go do that! Sorry for the delay, my internet cut off when I read your message last night and I wasn’t able to proceed.

agt · May 13, 2020, 7:18am

I tested the original code with the simple label2city demo using apex and discovered a few things:

It runs for a few epochs but quickly has the same error. It prints “Gradient overflow. Skipping step, loss scaler 0 reducing loss scale…” around 10 times per epoch, until the 7th epoch, where it gives me the same CUDA error: “an illegal memory access was encountered”
It happens after 3 epochs if I set --resize_or_crop scale_width_and_crop. It scales the width by default, so I don’t know why cropping the training images would cause the error to happen more quickly.
No difference if I add --label_nc 0 --no_instance which tells it to just map images in train_A to train_B, like I do in my code.
If I combine A-to-B method with cropping using --resize_or_crop scale_width_and_crop --label_nc 0 --no_instance, I get a different error:

Traceback (most recent call last):
  File "train.py", line 85, in <module>
    with amp.scale_loss(loss_G, optimizer_G) as scaled_loss: scaled_loss.backward()                
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR (findAlgorithms at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:551)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f8016ad4536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf38e75 (0x7f8017e68e75 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xf2c07a (0x7f8017e5c07a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xf2cd91 (0x7f8017e5cd91 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xf30dcb (0x7f8017e60dcb in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_input(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0xb2 (0x7f8017e61322 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xf97e40 (0x7f8017ec7e40 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xfdc6d8 (0x7f8017f0c6d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x4fa (0x7f8017e629ba in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0xf9816b (0x7f8017ec816b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xfdc734 (0x7f8017f0c734 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x2c809b6 (0x7f80514629b6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2cd0444 (0x7f80514b2444 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x378 (0x7f805107a918 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2d89c05 (0x7f805156bc05 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f8051568f03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f8051569ce2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f8051562359 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f805dca1378 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0xbd6df (0x7f80686476df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #20: <unknown function> + 0x76db (0x7f80697296db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x3f (0x7f8069a6288f in /lib/x86_64-linux-gnu/libc.so.6)

That error can be resolved by changing the number of discriminators from the default of 2, adding the option --num_D 1 or --num_D 3 or --num_D 4. No idea why this is, I was just testing how the number of discriminators affect output quality in non-apex training.

So it happens even on the simple demo with the default settings. But my training images make it happen immediately, maybe because their demo only has 8 training samples while I have a few thousand. But even with only 8, it happens in a few seconds, after 3-10 epochs.

ptrblck · May 14, 2020, 5:46am

Thanks for the update!
The last error message is especially helpful.

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR (findAlgorithms...

indicates that cudnn cannot find a valid algorithm for the current setup (conv, input, datatype, device).
Could you install the PyTorch binaries with CUDA10.2 and cudnn7.6.5.32?
If you are still running into this error, try to use benchmark mode via torch.backends.cudnn.benchmark=True or disable cudnn via torch.backends.cudnn.enabled=False.

The former option will benchmark each new input shape (which will make this iteration slower) and will try to find the fastest algorithm for the current workload.

agt · May 14, 2020, 7:31am

Do you think that error is the source of the illegal memory access errors in AMP training too? Why would that one only happen when the number of discriminators is set to the (default) of 2? I’m happy to use 1, 3, or 4 discriminators, as long as it can train without getting those gradient overflow, NaN, memory access errors.

I installed the latest version using !pip install torch===1.5.0 torchvision===0.6.0 -f https://download.pytorch.org/whl/torch_stable.html.

But now I get an error trying to install apex:
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-4rw655gw/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-4rw655gw/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-g9ds8p5h/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

And when I try to run the training script, I get a new error too:

Traceback (most recent call last):
  File "train.py", line 17, in <module>
    opt = TrainOptions().parse()
  File "/content/pix2pixHD/options/base_options.py", line 80, in parse
    torch.cuda.set_device(self.opt.gpu_ids[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 149, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 63, in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError: 
The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--fp16', '--resize_or_crop', 'scale_width_and_crop', '--name', 'label2city_512p', '--label_nc', '0', '--no_instance']' returned non-zero exit status 1.

Here’s a colab: https://colab.research.google.com/drive/1ed3lxHDORKkubREnIbnRiWqbghDF5Kju?authuser=5#scrollTo=vzISEGB8Da_H

ptrblck · May 14, 2020, 7:36pm

There are a couple of different issues here:

the illegal memory access should never happen and we are looking into it
gradient overflow is expected as long as it’s not constantly in every iteration and the loss scaler will reduce its scaling factor
a NaN output should not happen and there might be issues in the model architecture or a bug in amp
we recommend to install the nightly binaries, which come with native amp via torch.cuda.amp instead of building apex/amp