I see in latest pytorch distributed paradigms are moving into combining Data Parallel + Tensor Parallel+ Pipeline Parallel. Making the training a “N-D Parallel”. But No major updates on improving API’s (RPC framework) for “Parameter Server Strategy”??
In PS Strategy:
- If a model doesn’t fit in memory I can shard the parameters and store on several parameter servers and make workers call only required parameters.
- If a worker fails it doesn’t gracefully stop all my other workers(Which torchrun does for synchronous training). Ofcourse this leads to some degradation in performance simple put.
But isn’t this a great solution? and improve the communication bottlenecks in this API?