Invalid gradient at index 0 with FSDP ( gpt-model)

Getting the following with a HF “gpt2-medium” model with 2 mpi ranks. Looks like gradients are not sharded - 25731584 is half of 51463168 ? How do I go about fixing this ?

Function SplitWithSizesBackward0 returned an invalid gradient at index 0 - got [51463168] but expected shape compatible with [25731584]

PyTorch used is 2.0.1, cuda is 11.8.

Appears to be coming from .

here

through here

This is caused by the HF Accelerate package. HF Accelerate requires setting ACCELERATE_USE_FSDP=true for FSDP . Without that I think the Pyorch expects something, but the model offers something else.