Hi,
While fine-tuning a sliding-window transformer using FSDP2 + context_parallel using huggingface accelerate,
I noticed that training speed improved significantly — but model performance dropped.
After checking the source (_attention.py#L688),
it seems that attention operations are not yet supported for context parallel.
I suspect that the performance degradation is caused by the attention mask not being applied during CP execution.
Could you share whether:
-
attention mask support for context_parallel is planned, and
-
there’s any ETA or related PR for this feature?
Thanks!