CUDA error: an illegal memory access was encountered. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

I am training/finetuning a llama-2-7b model on 2 A100 80GBs with a custom RL algorithm. I am using pytorch nightly versions with cuda 11.8 and FSDP based on GitHub - facebookresearch/llama-recipes: Examples and recipes for Llama 2 model. The model trains for a couple of steps and then crashes with the following stack trace:

[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b6b6826f647 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x2b6b6822b8f9 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x2b6b68139588 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x2b6b156c3b90 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x2b6b156c79b8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x2b6b156de1db in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x2b6b156de4e8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xb94cf (0x2b6ac83184cf in /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x7ea5 (0x2b6abd860ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x2b6abe27cb0d in /lib64/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b6b6826f647 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x2b6b6822b8f9 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x2b6b68139588 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x2b6b156c3b90 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x2b6b156c79b8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x2b6b156de1db in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x2b6b156de4e8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xb94cf (0x2b6ac83184cf in /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x7ea5 (0x2b6abd860ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x2b6abe27cb0d in /lib64/libc.so.6)

Fatal Python error: Aborted

Things I have tried (based on @ptrblck suggestions from other similar posts):

  • Tried running the model with CUDA_LAUNCH_BLOCKING=1. The stack trace remains identical.
  • Tried compute sanitizer which did not help either.
    I have also tried other debug env variables like NCCL_DEBUG=INFO but the trace remains the same. Is there something else that I can do to debug and find the error? I print the intermediate information in my code to the terminal and find that both of my processes seem to be running the model independently:
generated samples
generated samples
att_mask device: cuda:1
att_mask device: cuda:0
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0][1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

tensor(0., device='cuda:1', grad_fn=<AddBackward0>)
tensor(0., device='cuda:1', grad_fn=<AddBackward0>)
Calculated RL loss
tensor(6.6880, device='cuda:0', grad_fn=<AddBackward0>)
tensor(6.6880, device='cuda:0', grad_fn=<AddBackward0>)

which is something I did not think FSDP does, as it shards the model across 2 processes instead of replicating the model separately across 2 GPUs. Am I missing some synchronization code in my script? Any ideas to further track this down would be helpful.

Thanks!

Are you using any custom layers in your model (e.g., via CUDA extensions)? This looks like the NCCL watchdog is surfacing a sticky failure (such as an illegal memory access) produced by some layer in the model. Is your model taking varying inputs (in terms of shapes or some other property)? This kind of failure after a few iterations can happen if e.g., a layer is not handling a specific shape/alignment properly, etc.

I am not using any custom layer. It’s just the GPT model that comes with llama-2 from Huggingface. My model does take in input with varying shapes but that happens across different batches. In a given batch, I ensure that the shapes are consistent.

Right, could you check if a specific input batch shape triggers the error?

I am looking into my batches individually but I got an error with slightly more information in the last run. The stack trace looks like this:

Traceback (most recent call last):
  File "/cluster/project/sachan/kushal/llama-exp/llama_rl_train.py", line 116, in train
    train_loss.backward()
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/utils/checkpoint.py", line 1071, in unpack_hook
    frame.recompute_fn(*args)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/utils/checkpoint.py", line 1194, in recompute_fn
    fn(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    query_states = self.q_proj(hidden_states)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: setStorage: sizes [4096, 4096], strides [1, 4096], storage offset 0, and itemsize 2 requiring a storage size of 33554432 are out of bounds for storage of size 0
Exception raised from checkInBoundsForStorage at ../aten/src/ATen/native/Resize.h:92 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b8cc8668647 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x2b8cc86248f9 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1f78319 (0x2b8c5eea2319 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>) + 0x104 (0x2b8c5ee99ee4 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x31883a5 (0x2b8c7790f3a5 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x31bff93 (0x2b8c77946f93 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) + 0x1e6 (0x2b8c5f35e4f6 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::as_strided_symint(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) const + 0x4a (0x2b8c5eea110a in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::transpose(at::Tensor const&, long, long) + 0x81b (0x2b8c5ee96adb in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2c81b23 (0x2b8c5fbabb23 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::_ops::transpose_int::call(at::Tensor const&, long, long) + 0x15f (0x2b8c5f824e7f in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::t(at::Tensor const&) + 0x4b (0x2b8c5ee779bb in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2c81aed (0x2b8c5fbabaed in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&) + 0x6b (0x2b8c5f7b0c7b in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x4a0a7d0 (0x2b8c619347d0 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x4a0a9e0 (0x2b8c619349e0 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&) + 0x6b (0x2b8c5f7b0c7b in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x43b0d79 (0x2b8c612dad79 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x43b1220 (0x2b8c612db220 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: at::_ops::t::call(at::Tensor const&) + 0x12b (0x2b8c5f7f580b in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #20: at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&) + 0x230 (0x2b8c5ec28b20 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x2e54ca3 (0x2b8c5fd7eca3 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #22: at::_ops::linear::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&) + 0x18f (0x2b8c5f38551f in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x67befa (0x2b8c5c319efa in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_python.so)

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Let me know if this is helpful to you in any way to track down the issue.
Thanks!

Cross-post from here. I would recommend sticking to one thread instead of creating multiple ones, as different users could re-debug the same issues.