I was wondering if you could confirm that PyTorch FSDP,
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, was compatible with Apex’s Tensor and Sequence Parallelism. So that one could use Column and Row Linear with Sequence Parallelism enabled in conjunction with the PyTorch FSDP model wrapper.
Conceptually I’m not aware of any fundamental incompatibilities between FSDP and tensor/sequence parallelism, but given the sheer complexity and additional communication requirements of FSDP I’d be skeptical if there was a performance benefit from combining the two. Have you observed any issues (compatibility/performance/convergence)?
Thank you for the response.
I am testing models a few different models of sizes 150m, 410m, 1B, and 3B with TP, SP, and Flash on C4. I plan on scaling to larger models after. Before the next runs with FSDP, I wanted to survey and see if there were any potential incompatibilities. I will document the training runs with FSDP + TP + SP + Flash and provide all of the hyperparameters and results once they finish.