Context Parallel – attention mask not supported yet? Performance drop in FSDP2 + CP fine-tuning

Hi,

While fine-tuning a sliding-window transformer using FSDP2 + context_parallel using huggingface accelerate,
I noticed that training speed improved significantly — but model performance dropped.
After checking the source (_attention.py#L688),
it seems that attention operations are not yet supported for context parallel.

I suspect that the performance degradation is caused by the attention mask not being applied during CP execution.

Could you share whether:

  1. attention mask support for context_parallel is planned, and

  2. there’s any ETA or related PR for this feature?

Thanks!