Why N-D Parallelism?

Ravi_Teja2 · October 4, 2024, 8:15am

I see in latest pytorch distributed paradigms are moving into combining Data Parallel + Tensor Parallel+ Pipeline Parallel. Making the training a “N-D Parallel”. But No major updates on improving API’s (RPC framework) for “Parameter Server Strategy”??
In PS Strategy:

If a model doesn’t fit in memory I can shard the parameters and store on several parameter servers and make workers call only required parameters.
If a worker fails it doesn’t gracefully stop all my other workers(Which torchrun does for synchronous training). Ofcourse this leads to some degradation in performance simple put.

But isn’t this a great solution? and improve the communication bottlenecks in this API?