When elastic training uses c10d
as the backend store and the master node fails, will the program restart?
Will elastic
choose a new master node if I use etcd
?
When elastic training uses c10d
as the backend store and the master node fails, will the program restart?
Will elastic
choose a new master node if I use etcd
?
@cbalioglu Was wondering if you knew what is the expected behavior here?
As of today we do no have a builtin failover mechanism, so if the master node fails, it will cause the training to terminate.
Will
elastic
choose a new master node if I useetcd
?
Sorry, I missed the second question. Yes, if you use etcd
, then if the worker on the master node fails, the agents will try to establish another round of rendezvous (up to --max-restarts
option you specified).
In summary c10d
is ideal if you don’t want to deal with installing and running a 3rd party dependency, etcd
is ideal if you care more about fault tolerance and fail over.
Thank you very much for your reply!
After reading the source code, I understood some execution mechanisms. Your reply makes me confirm that etcd
is a better choice for me.
I will deploy etcd server on a stable cpu machine, so that I can dynamically increase or decrease nodes without worrying about whether or not the master node fails, as long as the etcd server does not fail.
I am curious about one more thing.
If I use c10d
to run elastic training and set --nproc_per_node
to zero on a cpu machine, does the similar function of etcd
backend be achieved ?
Unfortunately not. You will still have a single point of failure even if the c10d store runs on a separate host. If that host fails, you would end up with a failure of the whole job. By the way you would have the same problem if you had a single etcd instance. The advantage you have with etcd is that you can set up a small cluster (2 or more machines) of etcd servers for failover handling.
Oh, I see. The most important thing about etcd is its distributed reliable key-value store. When using etcd, it is meaningless to have only one etcd instance.
Yep, that is correct.
Thank you for your patient reply! It’s a pleasure to communicate with you. I hope to discuss with you more in the future. Thank you again!
Thanks for the kind words! You are welcome!
Hi, @cbalioglu !
If I can ensure that my CPU machine will not fail, can I use c10d
as backend and set --nproc_per_node=0
on the CPU machine to achieve the similar function of etcd
backend? Will agents try to establish another round of rendezvous when the master node fails?