Supporting communication hook with Hybrid shard strategy in torch FSDP

samsja · July 19, 2024, 3:40pm

Hey all,

It seems that communication hooks are not compatible with the HYBRID_SHARD strategy in FSDP.

Anybody know if there is a plan to support in the future or if they are fundamental blocker to it ?

Thanks for the great work.

agu · July 22, 2024, 8:55pm

We do not have immediate plans to support this, but I would be curious to learn more about your use case

samsja · July 23, 2024, 12:13pm

I see, basically what I want to do is to have more flexibility around the communication. I want to do advanced grad compression for instance (beyond the fp16/bf16 one) or another advanced mechanism.

It does not have to be a communication hook though, I am wondering what is best in general to change the behavior of the all reduce in FSDP. Would fsdp2 be more suited for this?

Thanks in advance

agu · July 30, 2024, 12:36pm

We may be able to add a hook to FSDP2.

For your use case, are you mainly interested in customizing the all-reduce across replica groups?

samsja · August 8, 2024, 1:25pm

I am mostly interested in being able to do quantization of the gradient for the all reduce and being able to do some all reduce operation only on a sub-part of the total world size (ideally using group mesh)

I am implementing multiple local sgd algorithms and still want to leverage fsdp within one node and do “normal” ddp between the other. A bit like hybrid shard but more granular