Errors in backward when using FSDP -- RuntimeError: setStorage: sizes [32000, 4096], strides [4096, 1], storage offset 131076096, and itemsize 2 requiring a storage size of 524296192 are out of bounds for storage of size 0

Nardien · January 23, 2024, 1:59am

Hi. I am fine-tuning llama-7b model on 4 A100 GPUs, utilizing FSDP and LoRA (peft) based on the llama-recipes library (GitHub - facebookresearch/llama-recipes: Examples and recipes for Llama 2 model).

Torch version: 2.2.0+cu118

My goal is to compute the “per-token-gradient” for each parameter for analysis purposes.

Here’s how I’ve modified the training code:

I modified the model code to return non-averaged loss by setting the reduction in the CrossEntropyLoss class as ‘none’.
Instead of calling backward on the averaged loss, I now call backward on the loss for “each individual” token.

In below code, I changed this section:

github.com

facebookresearch/llama-recipes/blob/98b122e57a8df44f5b88fa9fdab8818cc6e4969f/src/llama_recipes/utils/train_utils.py#L125


      
                      if train_config.enable_fsdp:
                          model.clip_grad_norm_(train_config.gradient_clipping_threshold)
                      else:
                          torch.nn.utils.clip_grad_norm_(model.parameters(), train_config.gradient_clipping_threshold)
                  scaler.step(optimizer)
                  scaler.update()
                  optimizer.zero_grad()
                  pbar.update(1)
          else:
              # regular backpropagation when fp16 is not used
              loss.backward()
              if (step + 1) % gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
                  if train_config.gradient_clipping and train_config.gradient_clipping_threshold > 0.0:
                      if train_config.enable_fsdp:
                          model.clip_grad_norm_(train_config.gradient_clipping_threshold)
                      else:
                          torch.nn.utils.clip_grad_norm_(model.parameters(), train_config.gradient_clipping_threshold)
                  optimizer.step()
                  optimizer.zero_grad()
                  pbar.update(1)

to:

loss[0].backward(retrain_graph=True)
optimizer.zero_grad()
loss[1].backward()

However, this modification results in an error:

Traceback (most recent call last):
  File ".../lib/python3.8/site-packages/torch/autograd/__init__.py", line 411, in grad
    result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: setStorage: sizes [32000, 4096], strides [4096, 1], storage offset 131076096, and itemsize 2 requiring a storage size of 524296192 are out of bounds for storage of size 0

I also attempted torch.autograd.grad as follows:

grad1 = torch.autograd.grad(loss[0], updated_parameters, retain_graph=True, allow_unused=True)
optimizer.zero_grad()
grad2 = torch.autograd.grad(loss[1], updated_parameters, allow_unused=True)

but it led to the same error.

Considering I’m using FSDP and LoRA, this seems to be a complex issue.
One possible solution might be to perform forward computation for each token individually instead of the entire sequence, but this could be very time-consuming.

Additionally, I’m curious about the cause of this problem.
Does anyone have any insights?

ty_w · April 20, 2024, 5:52am

Same problem. How did you fix it? Thanks.