[Elastic Distributed Training] Will the master node be reselected and restarted if the master node fails?

logicShu · September 13, 2021, 3:57am

When elastic training uses c10d as the backend store and the master node fails, will the program restart?

Will elastic choose a new master node if I use etcd ?

pritamdamania87 · September 14, 2021, 1:39am

@cbalioglu Was wondering if you knew what is the expected behavior here?

cbalioglu · September 14, 2021, 2:20pm

As of today we do no have a builtin failover mechanism, so if the master node fails, it will cause the training to terminate.

cbalioglu · September 14, 2021, 7:39pm

Will elastic choose a new master node if I use etcd ?

Sorry, I missed the second question. Yes, if you use etcd, then if the worker on the master node fails, the agents will try to establish another round of rendezvous (up to --max-restarts option you specified).

In summary c10d is ideal if you don’t want to deal with installing and running a 3rd party dependency, etcd is ideal if you care more about fault tolerance and fail over.

logicShu · September 15, 2021, 3:15am

Thank you very much for your reply!

After reading the source code, I understood some execution mechanisms. Your reply makes me confirm that etcd is a better choice for me.

I will deploy etcd server on a stable cpu machine, so that I can dynamically increase or decrease nodes without worrying about whether or not the master node fails, as long as the etcd server does not fail.

logicShu · September 15, 2021, 10:36am

I am curious about one more thing.

If I use c10d to run elastic training and set --nproc_per_node to zero on a cpu machine, does the similar function of etcd backend be achieved ?

cbalioglu · September 20, 2021, 3:59pm

Unfortunately not. You will still have a single point of failure even if the c10d store runs on a separate host. If that host fails, you would end up with a failure of the whole job. By the way you would have the same problem if you had a single etcd instance. The advantage you have with etcd is that you can set up a small cluster (2 or more machines) of etcd servers for failover handling.

logicShu · September 22, 2021, 2:21am

Oh, I see. The most important thing about etcd is its distributed reliable key-value store. When using etcd, it is meaningless to have only one etcd instance.

cbalioglu · September 22, 2021, 3:53pm

Yep, that is correct.

logicShu · September 23, 2021, 3:38am

Thank you for your patient reply! It’s a pleasure to communicate with you. I hope to discuss with you more in the future. Thank you again!

cbalioglu · September 23, 2021, 1:01pm

Thanks for the kind words! You are welcome!

zzibc · September 5, 2022, 4:22pm

Hi, @cbalioglu !
If I can ensure that my CPU machine will not fail, can I use c10d as backend and set --nproc_per_node=0 on the CPU machine to achieve the similar function of etcd backend? Will agents try to establish another round of rendezvous when the master node fails?