Abstract
We are exploring the idea of overlapping AllGather and ReduceScatter in FSDP’s backward pass. This approach can reduce total communication time by up to half, especially in environments that support in-network computing (e.g., NVIDIA SHARP). It could significantly speed up FSDP workflows where communication is a bottleneck.
The basic concept is also discussed in Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI, presented at SC24.
Background
During the FSDP backward pass, the following collectives typically run in sequence:
- AllGather for parameter unsharding
- ReduceScatter for gradient aggregation
By using in-network computing or multicast for these operations, it’s possible to reduce either the amount of data received or sent compared to ring-based algorithms:
- ReduceScatter with in-network computing: Decreases the amount of data each rank receives, while keeping the amount sent the same
- e.g. NVIDIA SHARP, Reduction Server
- AllGather with multicast: Decreases the amount of data each rank sends, while keeping the amount received the same
- e.g. NVIDIA SHARP, MVAPICH MPI
If we execute these collectives in an overlapping manner, the data receiving and sending times of the two collectives are equalized, and as a result, the overall communication time is shortened.
In-network computing solutions (e.g., SHARP) have become widely available and increasingly cost-effective, motivating further exploration of this approach.
Questions
-
Technical feasibility
Are there any major obstacles to running these collective communications in parallel? For example, do we need separate CUDA streams or non-blocking collective calls to make this work? -
Performance considerations
Do you think this overlapping strategy would yield substantial benefits in practice, or would overhead (e.g., synchronization or stream contention) limit the gains?
Any insights, shared experiences, or best practices would be greatly appreciated. Thank you.