whatdhack
(Whatdhack)
February 16, 2024, 3:03am
1
Getting the following with a HF “gpt2-medium” model with 2 mpi ranks. Looks like gradients are not sharded - 25731584 is half of 51463168 ? How do I go about fixing this ?
Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [51463168] but expected shape compatible with [25731584]
PyTorch used is 2.0.1, cuda is 11.8.
whatdhack
(Whatdhack)
February 27, 2024, 4:06am
2
Appears to be coming from .
here
at::Tensor InputMetadata::reduce_grad(at::Tensor& grad) const {
// reduce_grad should only be called if is_expandable_to_shape returns true.
TORCH_INTERNAL_ASSERT(maybe_expandable_to(grad));
return at::sum_to(std::move(grad), shape_as_dim_vector());
}
std::stringstream InputMetadata::incompatible_shape_error_message(
const size_t index,
const at::Tensor& grad) const {
std::stringstream ss{};
ss << "invalid gradient at index " << index << " - got ";
if (::torch::autograd::is_cpp_nested_tensor(grad)) {
ss << grad._nested_tensor_size();
} else {
ss << grad.sym_sizes();
}
ss << " but expected shape compatible with ";
if (is_cpp_nested_tensor()) {
ss << shape_as_tensor();
} else {
ss << shape_as_dim_vector();
through here
tuple(inputs) if inputs is not None else tuple()
grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
if retain_graph is None:
retain_graph = create_graph
# The reason we repeat same the comment below is that
# some Python versions print out the first line of a multi-line function
# calls in the traceback and some print out the last line
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
tensors, grad_tensors_, retain_graph, create_graph, inputs,
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
def grad(
outputs: _TensorOrTensors,
inputs: _TensorOrTensors,
grad_outputs: Optional[_TensorOrTensors] = None,
retain_graph: Optional[bool] = None,
create_graph: bool = False,
only_inputs: bool = True,
whatdhack
(Whatdhack)
March 1, 2024, 4:03am
3
This is caused by the HF Accelerate package. HF Accelerate requires setting ACCELERATE_USE_FSDP=true for FSDP . Without that I think the Pyorch expects something, but the model offers something else.