I am trying to deploy a crash-resilient distributed deployment in PyTorch. Assume that node of rank 0 controls the learning process. Now, if this node fails/crashes, I want node 1 to continue the job of node 0 and somehow notifies other nodes of the change. One way to do this is to assign the rank 0 to node 1 so that all nodes can communicate directly as usual with node 0 (which was node 1 before the change). Is there any way to do this?
Hey @aguirguis, if you are looking for elastic training for distributed data parallel, torchelastic is built for this purpose. It will conduct re-rendezvous on living nodes when failure occurs.
thanks @mrshenli. @aguirguis, here’s the quickstart guide for TorchElastic (http://pytorch.org/elastic/0.2.0rc0/quickstart.html). If you are familiar with
torch.distributed.launch things should look familiar. We’ve also written a kubernetes controller in collaboration with EKS, which you can check out here: http://pytorch.org/elastic/0.2.0rc0/kubernetes.html
Thank you @mrshenli and @Kiuk_Chung for your responses; they are really helpful.
I have two follow-up questions:
- According to this design, I have to run an etcd server, right? this is still a single point of failure. Is there any way to circumvent that?
- Is there any way to force some specific values for ranks to some specific nodes?
Thanks a lot.
Thanks @Kiuk_Chung for your answers. I will check the etcd FAQ.
I’m thinking of a parameter server deployment with a crash tolerance to the central server. I want to deploy, let’s say, 2 servers so that if one crashes, the other one takes over. Yet, I want this to be transparent to the workers, i.e., they still send their gradients normally to the process with rank 0.
Is there an easy and cheap way to do this currently?
You could do this using torch rpc. Say your parameter servers follow some naming convention like “ps:##” and you know the total ps replicas. Then you could round robin or fall back to the surviving parameter servers if an rpc call fails.
How do you plan on keeping the data in the parameter servers replicated so that you are tolerant to ps failures?
Yes, I was actually thinking of using rpc. My only concern is about its performance, compared to the other collective primitives (e.g., gather, broadcast…etc.). Can you comment on this performance comparison?
The problem of consistency among PSes could be solved using checkpoints. Assume there is one primary PS, I’d let this PS only to update the model and periodically checkpoint it to some persistent database. If it crashes, the backup PS first loads the latest checkpoint and then continues training normally. How does that sound?