[Elastic Distributed Training] Will the master node be reselected and restarted if the master node fails?

When elastic training uses c10d as the backend store and the master node fails, will the program restart?

Will elastic choose a new master node if I use etcd ?

@cbalioglu Was wondering if you knew what is the expected behavior here?

As of today we do no have a builtin failover mechanism, so if the master node fails, it will cause the training to terminate.

Will elastic choose a new master node if I use etcd ?

Sorry, I missed the second question. Yes, if you use etcd, then if the worker on the master node fails, the agents will try to establish another round of rendezvous (up to --max-restarts option you specified).

In summary c10d is ideal if you don’t want to deal with installing and running a 3rd party dependency, etcd is ideal if you care more about fault tolerance and fail over.

Thank you very much for your reply!

After reading the source code, I understood some execution mechanisms. Your reply makes me confirm that etcd is a better choice for me.

I will deploy etcd server on a stable cpu machine, so that I can dynamically increase or decrease nodes without worrying about whether or not the master node fails, as long as the etcd server does not fail.

I am curious about one more thing.

If I use c10d to run elastic training and set --nproc_per_node to zero on a cpu machine, does the similar function of etcd backend be achieved ?

Unfortunately not. You will still have a single point of failure even if the c10d store runs on a separate host. If that host fails, you would end up with a failure of the whole job. By the way you would have the same problem if you had a single etcd instance. The advantage you have with etcd is that you can set up a small cluster (2 or more machines) of etcd servers for failover handling.

Oh, I see. The most important thing about etcd is its distributed reliable key-value store. When using etcd, it is meaningless to have only one etcd instance.

Yep, that is correct.

Thank you for your patient reply! It’s a pleasure to communicate with you. I hope to discuss with you more in the future. Thank you again!

1 Like

Thanks for the kind words! You are welcome!

Hi, @cbalioglu !
If I can ensure that my CPU machine will not fail, can I use c10d as backend and set --nproc_per_node=0 on the CPU machine to achieve the similar function of etcd backend? Will agents try to establish another round of rendezvous when the master node fails?