Supporting communication hook with Hybrid shard strategy in torch FSDP

Hey all,

It seems that communication hooks are not compatible with the HYBRID_SHARD strategy in FSDP.

Anybody know if there is a plan to support in the future or if they are fundamental blocker to it ?

Thanks for the great work.

We do not have immediate plans to support this, but I would be curious to learn more about your use case :slight_smile:

I see, basically what I want to do is to have more flexibility around the communication. I want to do advanced grad compression for instance (beyond the fp16/bf16 one) or another advanced mechanism.

It does not have to be a communication hook though, I am wondering what is best in general to change the behavior of the all reduce in FSDP. Would fsdp2 be more suited for this?

Thanks in advance

We may be able to add a hook to FSDP2.

For your use case, are you mainly interested in customizing the all-reduce across replica groups?

I am mostly interested in being able to do quantization of the gradient for the all reduce and being able to do some all reduce operation only on a sub-part of the total world size (ideally using group mesh)

I am implementing multiple local sgd algorithms and still want to leverage fsdp within one node and do “normal” ddp between the other. A bit like hybrid shard but more granular