RuntimeError: setStorage: sizes [4096, 4096], strides [1, 4096], storage offset 0, and itemsize 2 requiring a storage size of 33554432 are out of bounds for storage of size 0

I am finetuning a llama-2-7b using FSDP on 2 A100 80GBs. I am using the nightly version. My model runs for a couple of steps and crashes with the following stack trace:

Traceback (most recent call last):
  File "/cluster/project/sachan/kushal/llama-exp/llama_rl_train.py", line 116, in train
    train_loss.backward()
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/utils/checkpoint.py", line 1071, in unpack_hook
    frame.recompute_fn(*args)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/utils/checkpoint.py", line 1194, in recompute_fn
    fn(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    query_states = self.q_proj(hidden_states)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: setStorage: sizes [4096, 4096], strides [1, 4096], storage offset 0, and itemsize 2 requiring a storage size of 33554432 are out of bounds for storage of size 0
Exception raised from checkInBoundsForStorage at ../aten/src/ATen/native/Resize.h:92 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b8cc8668647 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x2b8cc86248f9 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1f78319 (0x2b8c5eea2319 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>) + 0x104 (0x2b8c5ee99ee4 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x31883a5 (0x2b8c7790f3a5 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x31bff93 (0x2b8c77946f93 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) + 0x1e6 (0x2b8c5f35e4f6 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::as_strided_symint(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) const + 0x4a (0x2b8c5eea110a in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::transpose(at::Tensor const&, long, long) + 0x81b (0x2b8c5ee96adb in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2c81b23 (0x2b8c5fbabb23 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::_ops::transpose_int::call(at::Tensor const&, long, long) + 0x15f (0x2b8c5f824e7f in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::t(at::Tensor const&) + 0x4b (0x2b8c5ee779bb in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2c81aed (0x2b8c5fbabaed in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&) + 0x6b (0x2b8c5f7b0c7b in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x4a0a7d0 (0x2b8c619347d0 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x4a0a9e0 (0x2b8c619349e0 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&) + 0x6b (0x2b8c5f7b0c7b in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x43b0d79 (0x2b8c612dad79 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x43b1220 (0x2b8c612db220 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: at::_ops::t::call(at::Tensor const&) + 0x12b (0x2b8c5f7f580b in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #20: at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&) + 0x230 (0x2b8c5ec28b20 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x2e54ca3 (0x2b8c5fd7eca3 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #22: at::_ops::linear::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&) + 0x18f (0x2b8c5f38551f in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x67befa (0x2b8c5c319efa in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_python.so)

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The model behaves differently with pytorch version 2.0.1+cu118 and results in a different (potentially similar root cause) error as detailed here. I have tried running with compute-sanitizer and CUDA_LAUNCH_BLOCKING but the stack traces remain identical. Any ideas to further debug this would be helpful.

compute-sanitizer should be able to detect the kernel causing the illegal memory access. If you get stuck, post a minimal and executable code snippet to reproduce the issue.

Hi, sorry for creating multiple threads. What should the output of compute-sanitizer look like? I am currently running all my scripts with it but the errors/stack trace does not seem to change. Meanwhile I changed some inputs to my model to narrow the possibilities and I get a different error (although the root cause seems the same):

 File "/cluster/project/sachan/kushal/llama-exp/llama_rl_train.py", line 280, in rl_sampling
    generated_idx = model.generate(
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 166, in record_pre_forward
    self._check_order(handle, is_training)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/peft/peft_model.py", line 1002, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 267, in _check_order
    raise RuntimeError(
  File "/cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering parameters for [['base_model.model.model.layers.0.self_attn.q_proj.weight', 'base_model.model.model.layers.0.self_attn.k_proj.weight', 'base_model.model.model.layers.0.self_attn.v_proj.weight', 'base_model.model.model.layers.0.self_attn.o_proj.weight', 'base_model.model.model.layers.0.mlp.gate_proj.weight', 'base_model.model.model.layers.0.mlp.up_proj.weight', 'base_model.model.model.layers.0.mlp.down_proj.weight', 'base_model.model.model.layers.0.input_layernorm.weight', 'base_model.model.model.layers.0.post_attention_layernorm.weight']] while rank 1 is all-gathering parameters for [['base_model.model.model.embed_tokens.weight', 'base_model.model.model.norm.weight', 'base_model.model.lm_head.weight']]
  

Can this error help in getting at the root cause?
I am trying to create a reproducible example but it’s proving to be a bit tricky as the error occurs in a distributed setting only with the given dataset/inputs.

The new error might be caused if your forward functions are not executing the same code path. Could this be the case for your model?

1 Like

Do you mean to say that the forward functions are different across ranks? That’s not the case. The forward functions are executing the same code. Related to this, in my understanding when I am using FSDP, my model gets sharded across the processes and so at once, there should be only one call to forward. In my model, it seems that both processes are running the forward pass like DDP. Is my understanding correct?
EDIT: I had some additional questions:

  1. I thought compute-sanitizer was working for me but it turns out its getting timed out. I am seeing this message:
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

I did not find anything relevant to solve this issue.
2. I am currently using it like this. Please let me know if this is correct usage.

compute-sanitizer torchrun --nnodes 1 --nproc-per-node 2 train.py
  1. Does TORCH_CUDA_SANITIZER perform a similar function to compute-sanitizer?

Thanks a lot for your time!

Is the issue solved? Im facing the exact same problem :zipper_mouth_face:

I am also facing same issue while running fairseq with torch. Did any one solved this issue yet?
cc @ptrblck @KUSHAL_JAIN

@KUSHAL_JAIN

I’m also facing similar error, but like you said, it can be reproduced but very tricky to write a small code snippet to do it. My project is a Llava style decoder using llama2 7b. Did you make any progress in debugging it?

compute-sanitizer will stuck forever, using TORCH_CUDA_SANITIZER=1 is super slow, haven’t reach to the error step yet =)

[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 6] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c58b63d87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2c58b1475f in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2c58c348a8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::startedGPUExecutionInternal() const + 0x7e (0x7f2c59d072ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isStarted() + 0x58 (0x7f2c59d0b458 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x303 (0x7f2c59d0eda3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2c59d0f839 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f2ca9963df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f2cae286609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f2cae3c0133 in /usr/lib/x86_64-linux-gnu/libc.so.6)```

Did you ever figure this out? struggling with exact same error myself but hard to repro. Just happens randomly through my trains.

Yes, that is the case for me. I am using gradient checkpointing and also have a custom additional layer for the llama decoder. Otherwise, I have the same issue as discussed in this chat. Is there something I can try?