Invalid gradient at index 0 with FSDP ( gpt-model)

whatdhack · February 16, 2024, 3:03am

Getting the following with a HF “gpt2-medium” model with 2 mpi ranks. Looks like gradients are not sharded - 25731584 is half of 51463168 ? How do I go about fixing this ?

Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [51463168] but expected shape compatible with [25731584]

PyTorch used is 2.0.1, cuda is 11.8.

whatdhack · February 27, 2024, 4:06am

Appears to be coming from .

here

github.com

pytorch/pytorch/blob/237773132d7f9ee474aff444906c133c49cb253b/torch/csrc/autograd/input_metadata.cpp#L145


      
          at::Tensor InputMetadata::reduce_grad(at::Tensor& grad) const {
            // reduce_grad should only be called if is_expandable_to_shape returns true.
            TORCH_INTERNAL_ASSERT(maybe_expandable_to(grad));
            return at::sum_to(std::move(grad), shape_as_dim_vector());
          }
          
          std::stringstream InputMetadata::incompatible_shape_error_message(
              const size_t index,
              const at::Tensor& grad) const {
            std::stringstream ss{};
            ss << "invalid gradient at index " << index << " - got ";
            if (::torch::autograd::is_cpp_nested_tensor(grad)) {
              ss << grad._nested_tensor_size();
            } else {
              ss << grad.sym_sizes();
            }
            ss << " but expected shape compatible with ";
            if (is_cpp_nested_tensor()) {
              ss << shape_as_tensor();
            } else {
              ss << shape_as_dim_vector();

through here

github.com

pytorch/pytorch/blob/e9ebda29d87ce0916ab08c06ab26fd3766a870e5/torch/autograd/init.py#L200


      
                  tuple(inputs) if inputs is not None else tuple()
          
              grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
              grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
              if retain_graph is None:
                  retain_graph = create_graph
          
              # The reason we repeat same the comment below is that
              # some Python versions print out the first line of a multi-line function
              # calls in the traceback and some print out the last line
              Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
                  tensors, grad_tensors_, retain_graph, create_graph, inputs,
                  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
          
          def grad(
              outputs: _TensorOrTensors,
              inputs: _TensorOrTensors,
              grad_outputs: Optional[_TensorOrTensors] = None,
              retain_graph: Optional[bool] = None,
              create_graph: bool = False,
              only_inputs: bool = True,

whatdhack · March 1, 2024, 4:03am

This is caused by the HF Accelerate package. HF Accelerate requires setting ACCELERATE_USE_FSDP=true for FSDP . Without that I think the Pyorch expects something, but the model offers something else.